Yes, Keyword Search is About to Hit its Breaking Point

Tuesday, May 27, 2008

A debate rippled across a few tech blog sites following Erick Schonfeld’s reiteration, a few weeks ago, of some claims made by Nova Spivack concerning the fate of traditional keyword search. As Schonfeld explains, Spivack is of the opinion that as the number of web pages a search engine has to sift through explodes exponentially, the efficacy of a simple keyword search will drop off. Spivack himself explains the problem as follows:

“Keyword search engines return haystacks, but what we really are looking for are the needles. The problem with keyword search such as Google’s approach is that only highly cited pages make it into the top results. You get a huge pile of results, but the page you want the — ‘needle’ you are looking for — may not be highly cited by other pages and so it does not appear on the first page. This is because keyword search engines don’t understand your question, they just find pages that match the words in your question.”

Obviously, the idea is to herald the rise of the Semantic Web and semantic search technologies, which Spivack believes will inevitably supplant the traditional Google search. But such boasts can’t long go unchallenged, as naysayers had started saying their nays within a few hours of Schonfeld’s article’s appearance on TechCrunch. Chris Edwards and Ian Lurie promptly trumpeted their enduring allegiance to good old-fashioned keyword search. Lurie’s case for the object of his loyalty is a pragmatic one: keyword search seems to work pretty well for 99% of the population, and the Semantic Web will never happen because it requires universal adoption of a new standard. Edwards takes his argument a little further, pointing out that keyword search isn’t quite as simple as searching for documents in which a word or words appear. Rather, keyword search engines use statistical machinery that has been evolving since the 70s to retrieve the documents that are probably the most closely aligned with what the user means by a particular query, and this statistical machinery will keep evolving to meet future challenges. And also, the Semantic Web will never happen because it requires universal adoption of a new standard.

I can’t argue with the claims that keyword search has done a bang up job so far or that the Semantic Web is far from an inevitability, but these arguments are missing the point. The overwhelming majority of Googlers only venture out to the third page of their search results, at most. In those first couple pages, there will be a few results that are close to what the user was looking for, a few that are way off, and hopefully a few that are close enough to count as the ‘right’ results. Since keyword search engines use statistical methods, you’d expect there to be some degree of noise in the results, in some form or another. And there is: all those results that had nothing to do with what the user is looking for. Add to that the results that may be pretty close in an objective sense, but are still of no use to the user, and a few redundant results just waste space on the results page without saying much more than the result before, and you start to see how cluttered with noise those first few pages of the search can be. Now, if the number of web pages being searched increases dramatically, the number of pages returned for the average search will see a dramatic increase as well, and the amplitude of that noise sitting on top of the results will also increase. The undeniable consequence is that, all other things being equal, as the results become significantly greater in number, it become less likely that the best result will actually be on the first page of results, or the second, or even the third.

As Edwards points out, the statistical methods used by the big search engines are being constantly optimized. But are they being optimized fast enough to keep up with an exponentially exploding problem? A study conducted by iProspect and Jupiter Research concluded in 2006 that the ability of search engines to ensure that the object of the user’s search was in the first page or two of the results was still increasing. However, it’s clear from the data that the increase was swiftly decelerating, apparently approaching the exact sort of plateau Spivack prophecies, with only a 2% increase in average search engine efficacy from 2004 to 2006. What keyword loyalists are missing is that, regardless of how effective keyword search has been in the past or how unlikely Semantic Web technologies are to equal such success, there are dark clouds looming large on the horizon for the search engine status quo.

But if a problem is indeed developing, which requires something of a tectonic shift in the way search is done, and the Semantic Web isn’t the answer, then what is? I stumbled across yet another response to Schonfeld’s article, which I believe contains about half the answer to this question. As the author notes, when we’re faced with too much information for focused searches to be productive, we rely on an unfocused method of discovery to absorb interesting information. It’s hard to know what queries to give in searching for new music, because there are just so many different things out there, so we rely on recommendations from friends, magazines, or websites, or chance encounters with singles. There’s no doubt that people will come to rely more and more on discovery-based web browsing in addition to search, as just in the last couple years we’ve seen an explosion of popularity of sites that facilitate such discoveries, such as Digg, Reddit and Twitter. So, at least some part of the answer to the above question seems to be that, in order to satisfy people’s needs for interesting information, new tools will have to be developed for efficiently facilitating the serendipitous encounter of new and intriguing material on the web.

I do believe the tools that would allow users to more efficiently discover useful information on the web in this manner, and the tools that would better enable users to filter out the noise in focused searches I alluded to above, might just be one and the same. By providing users with an intuitive representation of the composition of the body of search results and the conceptual interconnections between them, along with an unprecedented level of control and dexterity in manipulating search results, one enables the user to throw out the noise and focus on what’s interesting, or follow a hot trail and stumble upon new information well off the beaten path. That sounds like a pretty good solution to me.