Kosmix and the Semantic Web

Saturday, February 07, 2009

I just recently ran across an “interview”:http://www.beet.tv/2008/06/kosmix-topical.html with Anand Rajaraman, founder of Kosmix, and something that was said toward the end of the interview piqued my interest. The subject of the Semantic Web came up, the existence of which Anand claimed would far more likely be brought about by apps “mining intelligence” out of the internet’s squall of information, rather than the universal adoption of a common semantic ontology like RFL. We certainly agree with that, as we believe that the winners of the race to establish the next generation of web search will be the ones who mine intelligence the most efficiently. However, something stuck in my craw about Kosmix being held up as an example of the various expeditions presently being made in this general direction. Which isn’t to say that I think Kosmix is not on such an expedition, but rather that I seem to be feeling the same vague perturbation I felt when I first made an expedition of my own through the flurry of noise on Kosmix, after hearing about the explorative experience supported by their search engine.

Even though Kosmix can most certainly be counted among the various search engines taking steps, in one way or another, toward bringing to life what folks call the Semantic Web, the steps being made range from insufficiently ambitious for my taste to simply misguided (if one’s goal is to intelligence-mine one’s way to the Semantic Web). Let’s start with a property of Kosmix which I’ve already voiced my complaints over: the noise. In my last blog post, I responded to a blog article written by Mr. Rajaraman, in which he likened Kosmix searches to “exploring haystacks” rather than looking for needles, as in most keyword searches. A series of commenters to the blog, who I quoted in my article, noted in various terms that there was simply too much noise in the search results, and it was too hard to bring the search into focus when one wanted. These complaints remind me of an “article”:http://www.techcrunch.com/2008/04/17/web-30-will-be-about-reducing-the-noise%E2%80%94and-twhirl-isnt-helping/ I caught on TechCrunch a long while back by Erick Schonfeld, entitled “Web 3.0 Will be about Reducing Noise—And Twhirl Isn’t Helping.” As one would expect, among Erick’s various complaints about Twhirl is the claim that the next generation of web search/exploration will be very much concerned with noise reduction, and that this is not incompatible with many people’s view that it will be about the establishment of a Semantic Web. But it seems pretty clear to me that, not only are these two visions of the future compatible, they are necessarily intertwined.

Tim Berners-Lee has described the transition to the Semantic Web as being centered around a shift from a view of the internet as a collection of documents to one of a collection of data or knowledge. Building the Semantic Web means turning the internet from a collection of documents over which keyword searches allow for retrieving pieces of information, into a coherent, navigable corpus of knowledge, from which one can similarly retrieve pieces or bodies of information of arbitrary breadth and depth. Making this transition means seeking to establish a representation of every piece of information in every document online, as well as the many ways in which each piece of information relates to other pieces of information. Establishing the requisite coherence in the web needed to support the sort of web search we dream about, in which one can retrieve information ranging from a simple answer for a simple question to a crash course in an academic discipline, means applying an overwhelming amount of structure to the web, and presenting users with a representation of the web (or small parts of it at least) that reveals that underlying structure and lets them freely navigate it. Although Kosmix may seem to so apply structure to the web, by seeking out all the possible types of search results the user may be looking for, and amassing them all in a profile page for the topic submitted by the user, if one spends a fair amount of time really trying to explore a subject, whatever structure is represented only becomes increasingly obscured. For any query submitted, Kosmix assembles a “profile page” for that topic, which amounts to a little bit of everything you could have possibly been asking for. Attempting to refine the search by selecting a related item “In the Kosmos” from the right margin, only leads to Kosmix casting yet another wide net, and another mess of results which could be what you were looking for. As I’ve said before, I respect their attempt to support the sort of web exploration we want to see made possible, but the utter lack of dexterity on the user’s part in navigating the search results makes it utterly impossible to truly explore. Furthermore, simply retrieving every peripherally related search result for a given query, and trying to fit as many of them on one page as possible, does nothing to reveal any underlying structure. Any intelligence they may be mining out of the internet on the back-end is lost in the noise on the front.

Again, we count Kosmix among the groups participating in the gradual progression towards the Semantic Web. We greatly respect what they’re trying to do, and Anand’s metaphorical contrast between exploring haystacks and searching for needles on the web certainly resonates with our goals. But as I’ve written previously, the degree to which the user can effectively engage in such exploration is closely correlated with the dexterity with which the user can sift through search results. When there’s no way for the user to focus a search at will, and clear out the noise, that dexterity is greatly limited. Kosmix needs to restructure the way they present their search results, as well as how they let users navigate results, so that they’re not just returning haystacks, and the user can most effectively explore “The Kosmos.” In the interview I mentioned above, Anand notes that the Semantic Web will most likely be brought about by the efforts of a number of different companies. Without a doubt, the push toward the Semantic Web will draw upon the collective efforts of a diverse range of organizations taking a number of different approaches to mining intelligence out of the web, and many of these organizations would greatly benefit from collaborating with or learning from others taking different approaches to the push forward. Just as we’re always looking for ways to ultimately better allow our users to forage the web, and new audiences or organizations who would find our search and research services particularly useful, we hope the Kosmix team will consider venturing a bit further from the traditional way users interact with a search engine, in order to maximize the usefulness of their search engine.

On Kosmix and Needles in Haystacks

Friday, January 16, 2009

A little over a week ago, TechCrunch featured an “article”:http://www.techcrunch.com/2008/12/08/kosmix-raises-20-million-more-for-its-universal-search-engine/ on the latest round of funding raise by rising star in web search, Kosmix. In said latest round, Kosmix managed to rake in an impressive $20 million from a wide range of investors, led by Time Warner, bringing the search engine’s total funding to $55 million. Upon paying a visit to their site, it becomes immediately apparent what all the hoopla’s about. Kosmix pulls the top search results from a variety of popular sources in a variety of different categories, including video sites YouTube and Truveo, info sites like Wikipedia and HowStuffWorks, and shopping sites like Amazon and Ebay, in order to create a mash-up of all the possible kinds of information you might be interested in. A collection of related subjects are also presented in the left margin of the results page, in order to facilitate some degree of search refinement. Without a doubt, Kosmix provides a search experience quite distinct from the major search engines, and I feel fairly safe in saying that most searches performed with Ask would yield more fulfilling results with Kosmix. And yet, after playing with Kosmix for a while, I felt as though something was amiss.

In the various articles I’ve read about Kosmix, and on founder Anand Rajaraman’s own blog, the approach employed is described as “horizontal search” or alternatively as “creating a homepage for any topic.” The idea being that if the user doesn’t know exactly what he is looking for, he can enter a query and get a cross section of everything the web has to offer on the subject, in any conceivably relevant context. For example, searching for “fusion” gets you a snippet from a Wikipedia article on Nuclear fusion, a few price quotes for recent Ford Fusion models, a couple HowStuffWorks articles on modern nuclear fusion reactors, a couple audio clips from Sat Mahaori’s album, the Khmer Fusion Project, a pair of articles from Helium.com on fusion energy and nuclear fusion bombs, a video of a live jazz fusion performance, a news article on recent scientific advancements in nuclear fusion, and the list goes on and on. Certainly, if I had literally no idea what I wanted other than that it involved the word “fusion” I would be in hog heaven. Of course, searching for “fusion” on Google gets a similarly wide range of different types of results, as the word has a wide range of uses. However, by displaying the results simply in order of popularity, and for queries as loaded as “fusion” providing a set of related, but more specific queries, if the user is looking for quick satisfaction in the form of a specific piece of information, in the vast majority of cases, Google provides it.

Rajaraman justifies the Kosmix approach of collecting a myriad of different sets of search results on one page in a blog article entitled “Searching For a Needle or Exploring the Haystack?” Obviously, the approach taken by Google and the other major search engines is identified as needle-searching. Rajaraman explains that Kosmix is best suited for exploration of the web, without knowing what exactly you’re looking for. This definitely resonates with some of the applications we’ve envisioned for Helioid, but something still seems to be missing, and part of that missing something is identified in the comments made to Rajaraman’s blog post. One commenter notes that the noise-to-signal ratio in Kosmix is unacceptably high, to which Rajaraman responds by saying that this is bound to be the case in web “exploration.” Another commenter says that he understands the appeal of a broad range of information, but feels it’s as though the floodgates are simply being opened in Kosmix, and he would prefer a more concise representation of the spectrum of available information, with the ability to “drill down” into specific topics or categories. Yet another commenter says that he too understands the appeal of haystack exploration, but that one varyingly wants to search for specific information, and explore the web aimlessly. And there are many more along these lines.

It seems as though a fair number of the people giving Kosmix a try are seeing the same problems that I am, and it also seems like these same people would benefit from the solutions we intend to offer when Helioid branches into general web search. The first blog article I wrote for Helioid was about the decline of Google’s efficacy as the number of pages being searched over increases exponentially. It included the argument that, as the number of pages increases exponentially, and the noise-to-signal ratio in Google’s search algorithms is not sufficiently decreased, you’ll see more noise in the results in the first few pages, eventually displacing some valuable search results and making it less likely, on average, that one would find what they’re looking for on the first page, or even in the first few pages. I further argued that one solution to this problem also lent itself quite handily to the issue of getting as much breadth and depth as one wants in a series of web searches, and that was simply organizing the search engine in such a way as to maximize the dexterity of the user in navigating the search results (the article was written before Google started widely using related search terms). The user should be able to eliminate all the noise of a particular type in the results, or focus the results around a particular category, or the intersection between categories. Although Kosmix provides related search terms, clicking on these suggested queries does little in the way of “drilling down” into more specific collections of information. The results are always presented in the same “homepage for any topic” form, and thus will always favor breadth over depth, even to the point of making it near impossible to get any manner of depth of information about a given topic. And the depth achieved by simply selecting a category to drill down into still, in general, falls far short of what ought to be afforded to the user. By allowing users to broaden or narrow the categories over which they are searching, by providing the means to collect information from a range of sources into a coherent answer to a given question, and by supporting the intuitive navigation from a given web page to related pages, allowing for the exploration of a “topic space” which can be as focused or as unfocused as the user chooses, Helioid will maximize the dexterity with which anyone can browse the web, and so provide a one-stop-shop for searching for needles AND exploring haystacks.