Helioid Preview Now Open

Friday, March 04, 2011

This version of Helioid is a meta-search engine offering category based personalization. Below is a partial screen shot of the outputs.

By interaction with the categories on the left you can choose the results to be shown on the right. Please let us know if you have any suggestions, comments, criticisms, complaints, anything. Try out the Helioid preview now.

Satisfying Needs by Diversifying Topics

Thursday, February 24, 2011

When retrieving documents for a search query, a simplistic approach that ranks the most relevant documents highest will leave users completely unsatisfied if it incorrectly interprets the query. Incorrect interpretation is an inevitable possibility when dealing with ambiguous queries, which in some experiments were shown to represent over 16% of all queries [1]. Significant ambiguity can result from query terms that have many synonyms. We can hypothesize the the amount of ambiguity will increase as the internet (and search indexes) grow simply because the probability of the same term being used in different contexts increases as the number of terms increases.

As an example, consider the query "ajax" which could refer to web development methods of Asynchronous JavaScript and XML, to the Amsterdam soccer team, to the household cleaning product, to the Greek warrior, etc. Ranking solely by relevance score will likely lead to a first page of results that only satisfy the web development meaning, due to the number of highly interconnected online documents about this topic. In fact, when searching on Yahoo! these are the only results one receives (all searches performed February 14th 2011). If we happened to be searching for one of the other meanings we'd need to adjust our query. When searching on Google Netherlands we end up with the opposite problem and all the results relate to the Amsterdam soccer team. However, when searching on Google English we see a result about the Amsterdam soccer team and a town in Calgary mixed into the top 10.

When addressed in a more general Artificial Intelligence setting, as opposed to the Information Retrieval setting we're focusing on here, this problem is occasionally referred to as the "Paris Hilton problem." Is a searcher who enters this query interested in the woman or the Parisian hotel? Although the results are dominated by web sites about the Hilton heiress, Google does provide one site, at rank 9, satisfying those of us seeking to book a stay at the Hilton Arc de Triomphe Paris.

Recently information retrieval research has started to put more focus on addressing the problem of ambiguous queries. The primary solution has been to alter the search result list so that our top results not only reflect highly relevant results, but results that cover the multiple meanings a query could represent -- to diversify the search results. This is analogous to the reinforcement learning strategy of "exploit and explore" in which we make what appears as a sub-optimal decision so that we can gain information and make better decisions in the future. There is also a relationship to risk management in portfolio theory: by cautiously covering multiple meanings we can reduce the worst case outcome when we incorrectly predicted the user's intended meaning. The results of Google English appear to have incorporated something of a diversification strategy.

A key challenge in result diversification is to determine what the underlying topics of the returned search results are. Knowing our results' topics will teach us how to diversify our results. We recently presented work on precisely this topic, the image below shows a summary of the system we built [2].

Diversification System Diagram

The basic strategy is to apply a topic modeling algorithm to the fetched results and then use a reordering algorithm to ensure we highly rank both relevant documents and topics with a high probability of being in different topics. The specific implementation we design used Probabilistic Latent Semantic Analysis to generate topics but any other algorithm (e.g. Latent Dirichlet Allocation, Correlated Topic Model, etc.) could be substituted.

The inquisitive reader might be wondering, well, if I know the topics I like and the search engine knows the topics representing documents, why bother diversifying search results, why not simply provide me with documents related to my preferences? Relating this back to the previous post about search users' topic preferences, there is clearly a viable path of research here. We look forward to presenting details about our work integrating these techniques.

References

[1] M. Sanderson. Ambiguous queries: test collections need more sense. SIGIR '08, pp. 499-506, 2008.

[2] P. Lubell-Doughtie and K. Hofmann. Improving Result Diversity using Probabilistic Latent Semantic Analysis. DIR '2011, pp. 24-27, 2011.

Are our Preferences in a Sub-Space?

Sunday, January 30, 2011

The web is really big, but when one searches for information the topics they are looking for exist within a much smaller portion of the web. A system that knows our individual preferences can optimize its document search, by starting this search from documents that we likely prefer. Additionally, by learning and knowing our preferences, the system can improve its presentation of results by biasing document rankings towards documents which are similar in theme to documents we preferred in the past, or disambiguating our queries based on the interpretations we used in the past.

As an example of handling multiple meanings (polysemy), if we've searched for leopard and tiger we're showing an interest in something like the large cats topic, and when we search for jaguar we probably want the cat. However, if we've recently searched for lexus and bmw our interest is likely in the luxury cars topic and when searching for jaguar we probably want the car. Searching for jaguar using Google, on January 30th 2011, puts the car at the top – probably accurate in general but certainly not always.

This method can also be used to help with words whose meaning changes over time (dynamic referents). For example, if we search for Rahm Emmanuel, Obamacare, and then US president we're likely looking for Barack Obama. But if we had searched for Saddam Hussein, Dick Cheney, and then US president the chances we are looking for George Bush are much higher than in the previous sequence of searches, and it would be worthwhile to highly rank some George Bush related results.

Performing these disambiguation tasks and other personalization techniques relies on coming to an understanding of the searcher's topics of interest. To quantify the preferences of web users we'll begin by introducing a method to describe preferences that is understandable and useful. We can consider user preferences as being broken down into a finite set of "topics", each of which the user has a preference for. For easy visualization, we'll take three topics: Business, Sports, and Health (inspired by Google News). Let's consider a user who is moderately interested in Business, uninterested in Sports, and quite interested in Health. We can represent each these interests on a separate line graph for each topic with something like the following:

The large black dot indicates interest in a topic and the farther the dot is towards the right end of the arrow the more interested the user is in that topic. We can then consider each topic as a dimension in 3-dimensional space and plot the user's topic preferences in this space with each axis as a topic:

We now have a single black dot to indicate the user's topic preferences. Tracing the distance from the dot horizontally left, to the green line, we can determine the user's preference in the Health-dimension, tracing to the blue line vertically down, to the blue line, we can determine the user's preference in the Business-dimension. The Sports-dimension is represented by the axis going into the page, the arrow helps us see the dot's distance in this dimension by tracing down and then diagonally along the orange line.

Now, back to web search. A convenient feature of breaking down user preferences by topics is that we can also break down a document's content by topics, perhaps the dot above also represent the topics making up a document about health in the business setting, which mentions that sports can contribute to health. Given the divisions of a document into topics we can visualize our preferences by placing the documents we like in a topic-space just as above. (With an appropriate "distaste" weighting we can also incorporate disliked documents into the same method.)

Before we become too excited about this new way to map our preferences, a fair question to ask is: do we even have significant and meaningful preferences in topic space? That is, if we place the documents we like into topic space will they be clustered around some specific preferences or will they be more or less evenly distributed, and therefore not give us much useful information. We can empirically answer this question by examining user clicks in a search engine log.

To test the topical distribution of preferences we took the documents clicked on by an anonymous user, broke these documents into 3 topics (for easy visualization), and plotted the documents in 3 dimensional space:

In this plot each dot represents a document and the color of the dot is to help us distinguish the location on the z-axis (into the page) – redder dots are closer and darker dots are farther. We see that this user's clicked documents are located in a well defined sub-space of the entire topic space, and within this sub-space form some additional clumps, i.e. they are not evenly distributed. The dotted lines shown form a plane (a 2-dimensional surface or manifold) that is the best fit plane (using regression) to the 3-dimensional points.
This can be very useful information if we are helping the user look for documents. Going back to jaguars for an example, suppose when we place documents about the cars Lexus, BMW, etc. in the topic space they end up in the far-top-right, and when we place documents about leopards, tigers, etc. they end up in the close-bottom-left. Given our model, this shows us that the user's preferences are near the big cats and we can surmise the user is more interested in big cats than luxury cars. Then if this user searches for jaguar we will highly rank documents about jaguar the animal (apparently panthera onca).

The benefits of this sort of document reordering were tested in sketchily formulated experiments and produced encourage, but inconclusive, results. We stress that the benefit of this approach is not that we are simply providing documents similar to documents the user liked in the past. The benefit is that we are providing documents similar to the topics associated with previously liked documents. This is the nuanced, but essential, benefit of searching with an integrated topic models.

References

How Helioid Benefits Users

Tuesday, November 03, 2009

The simple answer to how Helioid benefits users is that Helioid represents information and information navigation in a more efficient manner. This gets a complex when looking at how each individual uses the internet and searches for information, but still the core is the same. A current issue with web search, as Google's Marissa Mayer "explains":http://www.techcrunch.com/2008/09/10/marissa-mayer-clarifies-search-is-only-10-done-not-90/, is that it is undeveloped and not advanced, "Think of it like biology and physics in the 1500s or 1600s: it's a new science where we make big and exciting breakthroughs all the time."

h4. Some specific problems in this new science:

  • No representation of relationships between search results on a content or conceptual level ** Information connections are "in the dark"
  • No (or extremely little) representation of relationships between search results on a physical hyperlink level
  • Searchers must play guessing game of coming up with correct keywords in order to get more relevant results
  • No ability to refine search or to narrow in on correct result or concept ** Searchers must start over with every new search they execute

h4. Some of the solutions Helioid provides:

  • Clear representations of connections between search results ** On a concept level ** On a link level
  • Clear representations of categories that a search result and groups of search results fall into ** Representation subject matter shared between different results for the same query
  • Can refine search so and therefore do not have to start over from scratch ** Can eliminate results or terms from search results to refine search and zoom in on wanted information ** Can save results and search query and then continue searching later

If Microsoft's Thumbtack was Intelligent

Saturday, January 17, 2009

In early December Microsoft Live Labs released "Thumbtack":http://livelabs.com/blog/introducing-thumbtack/, which is said to: "[use] machine learning and natural language techniques to understand the information you give it." Looking through the interface one notices some interesting tools. Such as a gadget that creates plots based on attributes of the items you collect and a "Layout Gadget" that I assume creates layouts but currently appears to only work with "IE7":http://thumbtack.livelabs.com/FAQ.aspx. Intelligent parsing of information, on demand analysis, visualization, there are great ideas here. The unaccomplished obstacle is how to allow users access to these in an intuitive and simple fashion.

The improvements that could make Thumbtack a more useful tool include:

1. Automatically parse attributes in content added to Thumbtack.

bq. In a "FAQ video":http://video.msn.com/?mkt=en-US&playlist=videoByUuids:uuids:fa6082f9-a8e0-4067-9c32-53ef1ae4ab42&showPlaylist=true&from=msnvideo it says that Thumbtack can automatically create properties from the attributes. How to make this happen is unintuitive and the video later goes on to say that the attributes for the automobile feature plots it creates have been added manually as name (key) value pairs. Having a properties gadget that actually presents an interface asking users to add properties as "name" and "value" pairs is fine in a debugger (or, in this context, as an advanced option) but this invites confusion and estrangement in a user interface.

bq. When a user adds content, Thumbtack should automatically parse that content into keys and values, as sets, or into another most useful representation (maybe a graph). If the parse gets things wrong the user should be able to change the values or labels, and the parsing engine will learn and hopefully know better next time. If the parsing engine's at a complete loss it can prompt the user for interactive guidance.

bq. With this capability built in the information would be much more useful to the user. Because newly added information would be quickly integrated into existing information through the parsed metadata, the user would explicitly see the value of adding and correctly categorizing new information. Additionally, the boundaries and definitions of user created categories would likely become more clearly defined.

2. Integrate Thumbtack into web search.

bq. The Thumbtack interface appears to be completely divorced from web search. The interface has a text field labeled "search" in the upper right that does not search the web but instead searches over items already present in Thumbtack. One of the primary times that users need to categorize and keep track of web information is when they are conducting searches, whether goal directed or exploratory.

bq. Thumbtack could be integrated into Microsoft's Live Search "API":http://msdn.microsoft.com/en-us/library/bb251794.aspx so that search results are displayed in the Thumbtack canvas and incrementally parsed for properties as the users looks through the results. Search results brought into the interface could be compared, arranged, and saved all while learning algorithms run in the background to determine what new information and what types of organization would be most helpful to the user.

3. Give the Thumbtack "bookmarklet" more features.

bq. Another "video":http://video.msn.com/?mkt=en-US&playlist=videoByUuids:uuids:6a905d98-0332-4c3f-8b25-75737cd9b675&showPlaylist=true&from=msnvideo demonstrates the "bookmarklet" that pops up over web pages so that while browsing you can add information into thumbtack. Unfortunately, this tool can only be used to add copied content, a title, and tags to a specific collection (items in Thumbtack are grouped into collections). Some things a user may want to do is add their content to more than one collection, assign some special properties to their content, maybe even share their content on the spot, or find things similar to this piece of content.

bq. In addition to giving the user more options in terms of how they treat the content they are adding, the tool would be very useful as an "inspector" palette for the current page. It could display information about the current page and show how it relates to items already in your Thumbtack collections. Microsoft Live Search data about the current content could be pulled up and the user could use this to explore similar content or narrow in on the current content.

bq. Users should also be able to use the Thumbtack's gadgets from within this popup. Maybe they want to see a quick plot of this content integrated into their existing collections, or maybe they want to see how adding this content would change the layout of already existing content. Forcing the user to switch between to environments is unnecessarily burdensome.

Microsoft Thumbtack is an impressive tool with a lot of potential. If it was given a more intuitive interface and a clear direction it could be very useful. Sadly, it seems that after its release it's slowly drifting into obscurity.

Google's SearchWiki as a step towards more user control

Wednesday, December 03, 2008

Google recently released their new SearchWiki feature which allows users, who are logged into a Google account, to rearrange search results (by clicking on arrows that move them up or down one slot), remove results from the returned list, and comment on results (all comments are made public). More information is in this Google "blog article":http://googleblog.blogspot.com/2008/11/searchwiki-make-search-your-own.html.

It's encouraging to see Google taking user responses into account. It has always been our opinion that this is something sadly missing from the mainstream search world. Google also states that the results' movements, removals, and comments will not be used as input to their search algorithms. Well, at least not yet.

Google knows that customization specific to the searchers individual information is important to delivering relevant results. In this "blog article":http://googleblog.blogspot.com/2008/07/more-transparency-in-customized-search.html about customized search transparency they detail the introduction of messages that inform users when a search has been customized based on location, recent searches, or web history (for those with Google accounts). Using the links from these messages you can also re-search removing the specified customization that was used as input to the results they provided.

Whether meant this way or not, the option to remove customization is an important step towards users controlling the way Google's algorithms function. For a relatively unlikely example, suppose you're in New York using "Tor":http://tor.eff.org/ to proxy your connection and the exit node is in San Francisco, so Google thinks you're in San Francisco. Your search for "new york bagel" might bring up the "New York Bagel" café in Mill Valley, which is entirely irrelevant. You can then tell Google to strip out their local search and re-searching will bring you the relevant results you want.

The sort of customization offered by SearchWiki is in a different vein. You're not meant to interact with or influence the search algorithm. If you perform a search and then start removing results you can continue until there are none left. This also means that if you perform a search for "new brunswick":http://www.google.com/search?q=new+brunswick, remove all the results mentioning Canada, and then go on to the second page of results, you'll still get a whole bunch of results mentioning Canada. Given your actions, it is unlikely you're looking for information about Canada's New Brunswick province, but after having removed all those irrelevant results Google will still present more of them.

One way Google could improve this problem is to bias their returned results after you begin removing and reordering your results. If they end up getting things wrong the user should be able to tell them so explicitly through customization options. The user should also have an option to turn this functionality off – or compare results with it on and with it off – if they wish. Helioid's algorithms can tell when they are getting things wrong implicitly by looking through the results the user removes and the way the user navigates through the returned results.

It's great to see Google innovating. There's a lot more that can be done.