Satisfying Needs by Diversifying Topics

Thursday, February 24, 2011

When retrieving documents for a search query, a simplistic approach that ranks the most relevant documents highest will leave users completely unsatisfied if it incorrectly interprets the query. Incorrect interpretation is an inevitable possibility when dealing with ambiguous queries, which in some experiments were shown to represent over 16% of all queries [1]. Significant ambiguity can result from query terms that have many synonyms. We can hypothesize the the amount of ambiguity will increase as the internet (and search indexes) grow simply because the probability of the same term being used in different contexts increases as the number of terms increases.

As an example, consider the query "ajax" which could refer to web development methods of Asynchronous JavaScript and XML, to the Amsterdam soccer team, to the household cleaning product, to the Greek warrior, etc. Ranking solely by relevance score will likely lead to a first page of results that only satisfy the web development meaning, due to the number of highly interconnected online documents about this topic. In fact, when searching on Yahoo! these are the only results one receives (all searches performed February 14th 2011). If we happened to be searching for one of the other meanings we'd need to adjust our query. When searching on Google Netherlands we end up with the opposite problem and all the results relate to the Amsterdam soccer team. However, when searching on Google English we see a result about the Amsterdam soccer team and a town in Calgary mixed into the top 10.

When addressed in a more general Artificial Intelligence setting, as opposed to the Information Retrieval setting we're focusing on here, this problem is occasionally referred to as the "Paris Hilton problem." Is a searcher who enters this query interested in the woman or the Parisian hotel? Although the results are dominated by web sites about the Hilton heiress, Google does provide one site, at rank 9, satisfying those of us seeking to book a stay at the Hilton Arc de Triomphe Paris.

Recently information retrieval research has started to put more focus on addressing the problem of ambiguous queries. The primary solution has been to alter the search result list so that our top results not only reflect highly relevant results, but results that cover the multiple meanings a query could represent -- to diversify the search results. This is analogous to the reinforcement learning strategy of "exploit and explore" in which we make what appears as a sub-optimal decision so that we can gain information and make better decisions in the future. There is also a relationship to risk management in portfolio theory: by cautiously covering multiple meanings we can reduce the worst case outcome when we incorrectly predicted the user's intended meaning. The results of Google English appear to have incorporated something of a diversification strategy.

A key challenge in result diversification is to determine what the underlying topics of the returned search results are. Knowing our results' topics will teach us how to diversify our results. We recently presented work on precisely this topic, the image below shows a summary of the system we built [2].

Diversification System Diagram

The basic strategy is to apply a topic modeling algorithm to the fetched results and then use a reordering algorithm to ensure we highly rank both relevant documents and topics with a high probability of being in different topics. The specific implementation we design used Probabilistic Latent Semantic Analysis to generate topics but any other algorithm (e.g. Latent Dirichlet Allocation, Correlated Topic Model, etc.) could be substituted.

The inquisitive reader might be wondering, well, if I know the topics I like and the search engine knows the topics representing documents, why bother diversifying search results, why not simply provide me with documents related to my preferences? Relating this back to the previous post about search users' topic preferences, there is clearly a viable path of research here. We look forward to presenting details about our work integrating these techniques.

References

[1] M. Sanderson. Ambiguous queries: test collections need more sense. SIGIR '08, pp. 499-506, 2008.

[2] P. Lubell-Doughtie and K. Hofmann. Improving Result Diversity using Probabilistic Latent Semantic Analysis. DIR '2011, pp. 24-27, 2011.

Are our Preferences in a Sub-Space?

Sunday, January 30, 2011

The web is really big, but when one searches for information the topics they are looking for exist within a much smaller portion of the web. A system that knows our individual preferences can optimize its document search, by starting this search from documents that we likely prefer. Additionally, by learning and knowing our preferences, the system can improve its presentation of results by biasing document rankings towards documents which are similar in theme to documents we preferred in the past, or disambiguating our queries based on the interpretations we used in the past.

As an example of handling multiple meanings (polysemy), if we've searched for leopard and tiger we're showing an interest in something like the large cats topic, and when we search for jaguar we probably want the cat. However, if we've recently searched for lexus and bmw our interest is likely in the luxury cars topic and when searching for jaguar we probably want the car. Searching for jaguar using Google, on January 30th 2011, puts the car at the top – probably accurate in general but certainly not always.

This method can also be used to help with words whose meaning changes over time (dynamic referents). For example, if we search for Rahm Emmanuel, Obamacare, and then US president we're likely looking for Barack Obama. But if we had searched for Saddam Hussein, Dick Cheney, and then US president the chances we are looking for George Bush are much higher than in the previous sequence of searches, and it would be worthwhile to highly rank some George Bush related results.

Performing these disambiguation tasks and other personalization techniques relies on coming to an understanding of the searcher's topics of interest. To quantify the preferences of web users we'll begin by introducing a method to describe preferences that is understandable and useful. We can consider user preferences as being broken down into a finite set of "topics", each of which the user has a preference for. For easy visualization, we'll take three topics: Business, Sports, and Health (inspired by Google News). Let's consider a user who is moderately interested in Business, uninterested in Sports, and quite interested in Health. We can represent each these interests on a separate line graph for each topic with something like the following:

The large black dot indicates interest in a topic and the farther the dot is towards the right end of the arrow the more interested the user is in that topic. We can then consider each topic as a dimension in 3-dimensional space and plot the user's topic preferences in this space with each axis as a topic:

We now have a single black dot to indicate the user's topic preferences. Tracing the distance from the dot horizontally left, to the green line, we can determine the user's preference in the Health-dimension, tracing to the blue line vertically down, to the blue line, we can determine the user's preference in the Business-dimension. The Sports-dimension is represented by the axis going into the page, the arrow helps us see the dot's distance in this dimension by tracing down and then diagonally along the orange line.

Now, back to web search. A convenient feature of breaking down user preferences by topics is that we can also break down a document's content by topics, perhaps the dot above also represent the topics making up a document about health in the business setting, which mentions that sports can contribute to health. Given the divisions of a document into topics we can visualize our preferences by placing the documents we like in a topic-space just as above. (With an appropriate "distaste" weighting we can also incorporate disliked documents into the same method.)

Before we become too excited about this new way to map our preferences, a fair question to ask is: do we even have significant and meaningful preferences in topic space? That is, if we place the documents we like into topic space will they be clustered around some specific preferences or will they be more or less evenly distributed, and therefore not give us much useful information. We can empirically answer this question by examining user clicks in a search engine log.

To test the topical distribution of preferences we took the documents clicked on by an anonymous user, broke these documents into 3 topics (for easy visualization), and plotted the documents in 3 dimensional space:

In this plot each dot represents a document and the color of the dot is to help us distinguish the location on the z-axis (into the page) – redder dots are closer and darker dots are farther. We see that this user's clicked documents are located in a well defined sub-space of the entire topic space, and within this sub-space form some additional clumps, i.e. they are not evenly distributed. The dotted lines shown form a plane (a 2-dimensional surface or manifold) that is the best fit plane (using regression) to the 3-dimensional points.
This can be very useful information if we are helping the user look for documents. Going back to jaguars for an example, suppose when we place documents about the cars Lexus, BMW, etc. in the topic space they end up in the far-top-right, and when we place documents about leopards, tigers, etc. they end up in the close-bottom-left. Given our model, this shows us that the user's preferences are near the big cats and we can surmise the user is more interested in big cats than luxury cars. Then if this user searches for jaguar we will highly rank documents about jaguar the animal (apparently panthera onca).

The benefits of this sort of document reordering were tested in sketchily formulated experiments and produced encourage, but inconclusive, results. We stress that the benefit of this approach is not that we are simply providing documents similar to documents the user liked in the past. The benefit is that we are providing documents similar to the topics associated with previously liked documents. This is the nuanced, but essential, benefit of searching with an integrated topic models.

References