The web is really big, but when one searches for information the topics they are looking for exist within a much smaller portion of the web. A system that knows our individual preferences can optimize its document search, by starting this search from documents that we likely prefer. Additionally, by learning and knowing our preferences, the system can improve its presentation of results by biasing document rankings towards documents which are similar in theme to documents we preferred in the past, or disambiguating our queries based on the interpretations we used in the past.
As an example of handling multiple meanings (polysemy), if we've searched for leopard and tiger we're showing an interest in something like the large cats topic, and when we search for jaguar we probably want the cat. However, if we've recently searched for lexus and bmw our interest is likely in the luxury cars topic and when searching for jaguar we probably want the car. Searching for jaguar using Google, on January 30th 2011, puts the car at the top – probably accurate in general but certainly not always.
This method can also be used to help with words whose meaning changes over time (dynamic referents). For example, if we search for Rahm Emmanuel, Obamacare, and then US president we're likely looking for Barack Obama. But if we had searched for Saddam Hussein, Dick Cheney, and then US president the chances we are looking for George Bush are much higher than in the previous sequence of searches, and it would be worthwhile to highly rank some George Bush related results.
Performing these disambiguation tasks and other personalization techniques relies on coming to an understanding of the searcher's topics of interest. To quantify the preferences of web users we'll begin by introducing a method to describe preferences that is understandable and useful. We can consider user preferences as being broken down into a finite set of "topics", each of which the user has a preference for. For easy visualization, we'll take three topics: Business, Sports, and Health (inspired by Google News). Let's consider a user who is moderately interested in Business, uninterested in Sports, and quite interested in Health. We can represent each these interests on a separate line graph for each topic with something like the following:
The large black dot indicates interest in a topic and the farther the dot is towards the right end of the arrow the more interested the user is in that topic. We can then consider each topic as a dimension in 3-dimensional space and plot the user's topic preferences in this space with each axis as a topic:
We now have a single black dot to indicate the user's topic preferences. Tracing the distance from the dot horizontally left, to the green line, we can determine the user's preference in the Health-dimension, tracing to the blue line vertically down, to the blue line, we can determine the user's preference in the Business-dimension. The Sports-dimension is represented by the axis going into the page, the arrow helps us see the dot's distance in this dimension by tracing down and then diagonally along the orange line.
Now, back to web search. A convenient feature of breaking down user preferences by topics is that we can also break down a document's content by topics, perhaps the dot above also represent the topics making up a document about health in the business setting, which mentions that sports can contribute to health. Given the divisions of a document into topics we can visualize our preferences by placing the documents we like in a topic-space just as above. (With an appropriate "distaste" weighting we can also incorporate disliked documents into the same method.)
Before we become too excited about this new way to map our preferences, a fair question to ask is: do we even have significant and meaningful preferences in topic space? That is, if we place the documents we like into topic space will they be clustered around some specific preferences or will they be more or less evenly distributed, and therefore not give us much useful information. We can empirically answer this question by examining user clicks in a search engine log.
To test the topical distribution of preferences we took the documents clicked on by an anonymous user, broke these documents into 3 topics (for easy visualization), and plotted the documents in 3 dimensional space:
The benefits of this sort of document reordering were tested in sketchily formulated experiments and produced encourage, but inconclusive, results. We stress that the benefit of this approach is not that we are simply providing documents similar to documents the user liked in the past. The benefit is that we are providing documents similar to the topics associated with previously liked documents. This is the nuanced, but essential, benefit of searching with an integrated topic models.
- A very accessible introduction to meaning in text is Geometry and Meaning by Dominic Widdows.
- Research and figures taken from the draft paper Applying Diversity and Novelty to Personalized Search, academic references are within.