23C3 - 1.5

23rd Chaos Communication Congress
Who can you trust?

Day 3
Room Saal 3
Start time 18:30
Duration 01:00
ID 1498
Event type Lecture
Track Science
Language English

Mining Search Queries

How to discover additional knowledge in the AOL query logs

AOL recently published over 34M weakly anonymized search queries from their users by intension. This lecture gives an overview on the results of an extensive statistical analysis and data mining procedure on this dataset. Thereby, a methodology for frequency analysis, search trend mining, topic detection and even user profiling and identification will be presented.

The lecture will give an overview on knowledge discovery techniques on a sample dataset of real search queries released by AOL. Although AOL anonymized the records by hiding the user name of the sender, this lecture will show how much knowledge you can already gain out of those web logs. The lecture targets on showing the dangers of progressional data collection and aggregation, particulary of rich user profile mining from search query logs.

This talk split into the following paragraphs:


  • Origin of the data
  • Aftermaths of publication
General analysis of dataset:
  • Structure, Size
  • Representativeness
  • Distribution over time
  • Distribution over user
  • Clickthrough of ranked sites
Topic analysis:
  • What topics do users search for?
  • Query distribution follows zipf's law
  • Statistical analysis of topic categories
Search trend mining:
  • Time slicing the dataset
  • Difference analysis of search queries in consecutive slices
  • Do search queries correlate with current events of time?
User profiling:
  • Generating user profiles out of search queries
  • Categorization of usage frequency, user's interests, competencies
  • Methods of user identification
  • Possible identification patterns
  • A broad spectrum of additional knowledge can be derived despite anonymization of data
  • User identification possible
  • Consequences for your searching behavior
Archived page - Impressum/Datenschutz