On Kosmix and Needles in Haystacks

Friday, January 16, 2009

A little over a week ago, TechCrunch featured an “article”:http://www.techcrunch.com/2008/12/08/kosmix-raises-20-million-more-for-its-universal-search-engine/ on the latest round of funding raise by rising star in web search, Kosmix. In said latest round, Kosmix managed to rake in an impressive $20 million from a wide range of investors, led by Time Warner, bringing the search engine’s total funding to $55 million. Upon paying a visit to their site, it becomes immediately apparent what all the hoopla’s about. Kosmix pulls the top search results from a variety of popular sources in a variety of different categories, including video sites YouTube and Truveo, info sites like Wikipedia and HowStuffWorks, and shopping sites like Amazon and Ebay, in order to create a mash-up of all the possible kinds of information you might be interested in. A collection of related subjects are also presented in the left margin of the results page, in order to facilitate some degree of search refinement. Without a doubt, Kosmix provides a search experience quite distinct from the major search engines, and I feel fairly safe in saying that most searches performed with Ask would yield more fulfilling results with Kosmix. And yet, after playing with Kosmix for a while, I felt as though something was amiss.

In the various articles I’ve read about Kosmix, and on founder Anand Rajaraman’s own blog, the approach employed is described as “horizontal search” or alternatively as “creating a homepage for any topic.” The idea being that if the user doesn’t know exactly what he is looking for, he can enter a query and get a cross section of everything the web has to offer on the subject, in any conceivably relevant context. For example, searching for “fusion” gets you a snippet from a Wikipedia article on Nuclear fusion, a few price quotes for recent Ford Fusion models, a couple HowStuffWorks articles on modern nuclear fusion reactors, a couple audio clips from Sat Mahaori’s album, the Khmer Fusion Project, a pair of articles from Helium.com on fusion energy and nuclear fusion bombs, a video of a live jazz fusion performance, a news article on recent scientific advancements in nuclear fusion, and the list goes on and on. Certainly, if I had literally no idea what I wanted other than that it involved the word “fusion” I would be in hog heaven. Of course, searching for “fusion” on Google gets a similarly wide range of different types of results, as the word has a wide range of uses. However, by displaying the results simply in order of popularity, and for queries as loaded as “fusion” providing a set of related, but more specific queries, if the user is looking for quick satisfaction in the form of a specific piece of information, in the vast majority of cases, Google provides it.

Rajaraman justifies the Kosmix approach of collecting a myriad of different sets of search results on one page in a blog article entitled “Searching For a Needle or Exploring the Haystack?” Obviously, the approach taken by Google and the other major search engines is identified as needle-searching. Rajaraman explains that Kosmix is best suited for exploration of the web, without knowing what exactly you’re looking for. This definitely resonates with some of the applications we’ve envisioned for Helioid, but something still seems to be missing, and part of that missing something is identified in the comments made to Rajaraman’s blog post. One commenter notes that the noise-to-signal ratio in Kosmix is unacceptably high, to which Rajaraman responds by saying that this is bound to be the case in web “exploration.” Another commenter says that he understands the appeal of a broad range of information, but feels it’s as though the floodgates are simply being opened in Kosmix, and he would prefer a more concise representation of the spectrum of available information, with the ability to “drill down” into specific topics or categories. Yet another commenter says that he too understands the appeal of haystack exploration, but that one varyingly wants to search for specific information, and explore the web aimlessly. And there are many more along these lines.

It seems as though a fair number of the people giving Kosmix a try are seeing the same problems that I am, and it also seems like these same people would benefit from the solutions we intend to offer when Helioid branches into general web search. The first blog article I wrote for Helioid was about the decline of Google’s efficacy as the number of pages being searched over increases exponentially. It included the argument that, as the number of pages increases exponentially, and the noise-to-signal ratio in Google’s search algorithms is not sufficiently decreased, you’ll see more noise in the results in the first few pages, eventually displacing some valuable search results and making it less likely, on average, that one would find what they’re looking for on the first page, or even in the first few pages. I further argued that one solution to this problem also lent itself quite handily to the issue of getting as much breadth and depth as one wants in a series of web searches, and that was simply organizing the search engine in such a way as to maximize the dexterity of the user in navigating the search results (the article was written before Google started widely using related search terms). The user should be able to eliminate all the noise of a particular type in the results, or focus the results around a particular category, or the intersection between categories. Although Kosmix provides related search terms, clicking on these suggested queries does little in the way of “drilling down” into more specific collections of information. The results are always presented in the same “homepage for any topic” form, and thus will always favor breadth over depth, even to the point of making it near impossible to get any manner of depth of information about a given topic. And the depth achieved by simply selecting a category to drill down into still, in general, falls far short of what ought to be afforded to the user. By allowing users to broaden or narrow the categories over which they are searching, by providing the means to collect information from a range of sources into a coherent answer to a given question, and by supporting the intuitive navigation from a given web page to related pages, allowing for the exploration of a “topic space” which can be as focused or as unfocused as the user chooses, Helioid will maximize the dexterity with which anyone can browse the web, and so provide a one-stop-shop for searching for needles AND exploring haystacks.