23C3 - 1.5
23rd Chaos Communication Congress
Who can you trust?
Speakers | |
---|---|
Robert |
Schedule | |
---|---|
Day | 3 |
Room | Saal 3 |
Start time | 18:30 |
Duration | 01:00 |
Info | |
ID | 1498 |
Event type | Lecture |
Track | Science |
Language | English |
Feedback | |
---|---|
Did you attend this event? Give Feedback |
Mining Search Queries
How to discover additional knowledge in the AOL query logs
AOL recently published over 34M weakly anonymized search queries from their users by intension. This lecture gives an overview on the results of an extensive statistical analysis and data mining procedure on this dataset. Thereby, a methodology for frequency analysis, search trend mining, topic detection and even user profiling and identification will be presented.
The lecture will give an overview on knowledge discovery techniques on a sample dataset of real search queries released by AOL. Although AOL anonymized the records by hiding the user name of the sender, this lecture will show how much knowledge you can already gain out of those web logs. The lecture targets on showing the dangers of progressional data collection and aggregation, particulary of rich user profile mining from search query logs.
This talk split into the following paragraphs:
Introduction:
- Origin of the data
- Aftermaths of publication
- Structure, Size
- Representativeness
- Distribution over time
- Distribution over user
- Clickthrough of ranked sites
- What topics do users search for?
- Query distribution follows zipf's law
- Statistical analysis of topic categories
- Time slicing the dataset
- Difference analysis of search queries in consecutive slices
- Do search queries correlate with current events of time?
- Generating user profiles out of search queries
- Categorization of usage frequency, user's interests, competencies
- Methods of user identification
- Possible identification patterns
- A broad spectrum of additional knowledge can be derived despite anonymization of data
- User identification possible
- Consequences for your searching behavior