The Intentional Web

Thursday, February 26, 2009

The majority of the time one browses or searches the web there is a goal in mind. Find the location of a coffee shop, learn more about cloud computing, see if there are any interesting new movies, be distracted and procrastinate. Each of these instances of web use has objectives and implicitly defines a success predicate. When one (the agent) interacts with the web, a computer or simply information (the system), that systems knowledge or discovery of an explicit representation of the agent's objectives, and the success predicates for these objectives, greatly enhances its capability to assist the agent in accomplishing its objectives.

Suppose I enter the query, "lee's art supplies nyc," into a search engine. A currently standard keyword search (PageRank, HITS, etc.) returns a list of results containing the keywords sorted based on page popularity. A semantic web search might parse out "lee's" as a title, "art supplies" as an object, and "nyc" as a location. Using this meaning based information the search could return the intersection of sets where each component appears as the correct semantic type and order by some bias between nearest neighbors based relevance and popularity. We are then returned a list of pages with things related to the title "lee's" and objects related to "art supplies" and locations related to "nyc". Neither of the above methods takes into account the user's intent in conducting their search.

There are a number of plausible intents one could have when entering the above example query. Maybe I want to find the location of..., learn the history of..., list the competitors of..., etc. Whatever the objective is, it is opaque to the query string. The intent is not calculated, it is not known.

Consider the same example but suppose that one can enter an objective along with their query (or that the engine can determine a probability distribution amongst objectives from the query and subsequent user actions). Assume my objective is something along the lines of, learn the history of.... It needn't be exactly determined by the algorithm as it likely isn't exactly determined or know by the user. Abstractly, we suppose that for each user query there exist a set of objectives and a distribution of importance over the accomplishment of these objectives. Given this set of objectives an information gathering agent is constructed. The agent's objective is to have the user accomplish the user's objectives. In the current example the agent's objectives may become something like user learn... which is then divided into self learn... and self teach user.... Each of these task is then further deconstructed into its simpler components as necessary and all tasks are completed bottom up.

The above characteristic of the intentional web is its ability to ascribe goals to users and conduct itself in a manner appropriate given those ascribed goals. Arguably, this is a change in the web's perception of user actions and not essentially a change in the web itself. A change in the web itself comes when we stop viewing the web and its components as just documents and data; as just network components (graphs) and purveyors of meaning (semantic units).

Each sub-graph of the web is seen as an agent capable of transitively effecting every connected component and itself in a non-well-founded manner. If there exists a broad objective that will increase the fitness of a large number of individual agents, the web agent community will transiently organize into larger units if necessary to accomplish this objective. The web becomes a community of organisms interacting with each other to accomplish their goals and increase their fitness.

As an example, consider the agent: financial news gatherer, whose objective it is to maintain connections to all pieces of financial news available throughout the web. The graph (group of sub-agents) that makes up this agent may contain the link gatherer, the financial news identifier (which itself may be a connection between the news identifier and the finance identifier), etc. If any of the sub-agents of the financial news gatherer increases their fitness, the financial news gatherer increases its fitness as well. What's more, when any agent whatsoever is improved, any other agent that contains the first agent will (generally) be improved as a welcome side-effect. Suppose the link gatherer is improved, any application that gathers links will improve with no required change to its own structure.

Decomposing specialties and organizing programs as links between specialized components is essential to modern programming and described by the concept of design patterns. Web programming has been moving in this direction with the popularity of model-view-controller (MVC) frameworks and plug-in architectures. These are positive movements but only transform and increase efficiency on a site level basis, not a network or community basis.

The trend towards application programming interfaces (APIs) and web services is significantly more relevant to the development of the intentional web. As an example, I've recently created a data aggregation web service, called Pairwise, and another service that pulls photos from Flickr, runs them through Pairwise and presents them to the user. We can think of Pairwise as a sort function, sort, and Flickr as a store of data, data. With these components built and maintained all a user must do to sort photos is the rather intuitive: sort(data). This is obviously still a ways away from what I've described above, yet it incorporates the essential component of specialized functionality.

Web mash-up building services, such as Yahoo! Pipes, provide an interface for users to create arbitrary combinations of data from throughout the web. With Pipes, people use a simple web interface to combine icons representing various programming functions. For example, one can drag a fetch data icon to a reverse icon, one can also embed "pipes" within each other. Pipes is a good example of integrating an abstract programming interface with internet data. But to add functions beyond what Pipes offers once must build an external web service and call it from within Pipes, which can perhaps be a nuisance if there is a large amount of additional functionality needed. Another problem with Pipes, as well as all mash-up builders, is related to API interoperability.

A current significant problem in moving towards an internet of connected services revolves around standards. To retrieve photos from Flickr, or use any other web service, learn the application interface and write (or use preexisting) code that operates within the domain of the application interface (in many cases one must also signup for an account with the service). There are no standards (or no standards commonly used throughout the web) that one can follow when writing an API and that one can refer to when interfacing with an arbitrary API, APIs are standard on a per application basis. In operating systems development it was quickly realized that not having specifications for APIs (a standardized API for APIs or meta-API) was extremely inefficient and a set of standards know as POSIX was developed in response. We need a set of POSIX-like open standards for the internet. For APIs to be completely not-interoperable until specialized code is written on a per-API basis is highly unproductive.

Without API standards the development of an intentional web, in the sense of a network that organizes information based on the determined goals of the other members of that network, is still possible. The difficulty would be that the programs which mine goals (intention) from web information would either need to be primarily self-contained or link together other services that use proprietary APIs themselves. Because large-scale adoption of new technologies by internet developers can be slow and is normally not done without justifiable cause, an effective approach may be to build an intentional web by linking other services together but linking them through an API-interpreter that converts arbitrary APIs into an open standard. Those who wish to use the intentional features of the web or any web service can use the open standards. Those who wish to develop services integrated into the intentional web, or accessible by other services using the open standards, can either write their API in conformance with the standards or write an interpreter that translates their API into the open standards. Additionally, anyone who wants a conforming version of an existing API can write an interpreter and thereby make this API available to all users. Preferably, there would be a discovery agent that builds an interpretation of arbitrary APIs into open standards as well as documents APIs in open standards format and in their original format. After processing an API, the discovery agent would monitor the API for changes and update the interpretation and documentation as necessary to maintain current functionality and add newly introduced functionality.

There are certainly more hurdles before a complete intentional web is developed but, without either open standards or automated API interpreters, only hubs of services capable of communicating with each other will develop. These will likely be controlled by Google, Microsoft, Amazon, and other big industry companies. With systems made interoperable based on open standards we would be able to connect one service to another without hassle. As all standards must, meta-API standards must have flexibility and extensibility built in as a core component. Ideally, these standards would emerge naturally from the existing APIs on the web and gracefully bend to fit changes occurring in the web over time. An eventual change in the web will be a movement beyond an intentional web. Systems should be developed to accommodate radical alteration of the fundamental structure that defines them.

Kosmix and the Semantic Web

Saturday, February 07, 2009

I just recently ran across an "interview":http://www.beet.tv/2008/06/kosmix-topical.html with Anand Rajaraman, founder of Kosmix, and something that was said toward the end of the interview piqued my interest. The subject of the Semantic Web came up, the existence of which Anand claimed would far more likely be brought about by apps "mining intelligence" out of the internet's squall of information, rather than the universal adoption of a common semantic ontology like RFL. We certainly agree with that, as we believe that the winners of the race to establish the next generation of web search will be the ones who mine intelligence the most efficiently. However, something stuck in my craw about Kosmix being held up as an example of the various expeditions presently being made in this general direction. Which isn't to say that I think Kosmix is not on such an expedition, but rather that I seem to be feeling the same vague perturbation I felt when I first made an expedition of my own through the flurry of noise on Kosmix, after hearing about the explorative experience supported by their search engine.

Even though Kosmix can most certainly be counted among the various search engines taking steps, in one way or another, toward bringing to life what folks call the Semantic Web, the steps being made range from insufficiently ambitious for my taste to simply misguided (if one's goal is to intelligence-mine one's way to the Semantic Web). Let's start with a property of Kosmix which I’ve already voiced my complaints over: the noise. In my last blog post, I responded to a blog article written by Mr. Rajaraman, in which he likened Kosmix searches to "exploring haystacks" rather than looking for needles, as in most keyword searches. A series of commenters to the blog, who I quoted in my article, noted in various terms that there was simply too much noise in the search results, and it was too hard to bring the search into focus when one wanted. These complaints remind me of an "article":http://www.techcrunch.com/2008/04/17/web-30-will-be-about-reducing-the-noise%E2%80%94and-twhirl-isnt-helping/ I caught on TechCrunch a long while back by Erick Schonfeld, entitled "Web 3.0 Will be about Reducing Noise—And Twhirl Isn't Helping." As one would expect, among Erick's various complaints about Twhirl is the claim that the next generation of web search/exploration will be very much concerned with noise reduction, and that this is not incompatible with many people's view that it will be about the establishment of a Semantic Web. But it seems pretty clear to me that, not only are these two visions of the future compatible, they are necessarily intertwined.

Tim Berners-Lee has described the transition to the Semantic Web as being centered around a shift from a view of the internet as a collection of documents to one of a collection of data or knowledge. Building the Semantic Web means turning the internet from a collection of documents over which keyword searches allow for retrieving pieces of information, into a coherent, navigable corpus of knowledge, from which one can similarly retrieve pieces or bodies of information of arbitrary breadth and depth. Making this transition means seeking to establish a representation of every piece of information in every document online, as well as the many ways in which each piece of information relates to other pieces of information. Establishing the requisite coherence in the web needed to support the sort of web search we dream about, in which one can retrieve information ranging from a simple answer for a simple question to a crash course in an academic discipline, means applying an overwhelming amount of structure to the web, and presenting users with a representation of the web (or small parts of it at least) that reveals that underlying structure and lets them freely navigate it. Although Kosmix may seem to so apply structure to the web, by seeking out all the possible types of search results the user may be looking for, and amassing them all in a profile page for the topic submitted by the user, if one spends a fair amount of time really trying to explore a subject, whatever structure is represented only becomes increasingly obscured. For any query submitted, Kosmix assembles a "profile page" for that topic, which amounts to a little bit of everything you could have possibly been asking for. Attempting to refine the search by selecting a related item "In the Kosmos" from the right margin, only leads to Kosmix casting yet another wide net, and another mess of results which could be what you were looking for. As I've said before, I respect their attempt to support the sort of web exploration we want to see made possible, but the utter lack of dexterity on the user's part in navigating the search results makes it utterly impossible to truly explore. Furthermore, simply retrieving every peripherally related search result for a given query, and trying to fit as many of them on one page as possible, does nothing to reveal any underlying structure. Any intelligence they may be mining out of the internet on the back-end is lost in the noise on the front.

Again, we count Kosmix among the groups participating in the gradual progression towards the Semantic Web. We greatly respect what they're trying to do, and Anand's metaphorical contrast between exploring haystacks and searching for needles on the web certainly resonates with our goals. But as I've written previously, the degree to which the user can effectively engage in such exploration is closely correlated with the dexterity with which the user can sift through search results. When there's no way for the user to focus a search at will, and clear out the noise, that dexterity is greatly limited. Kosmix needs to restructure the way they present their search results, as well as how they let users navigate results, so that they're not just returning haystacks, and the user can most effectively explore "The Kosmos." In the interview I mentioned above, Anand notes that the Semantic Web will most likely be brought about by the efforts of a number of different companies. Without a doubt, the push toward the Semantic Web will draw upon the collective efforts of a diverse range of organizations taking a number of different approaches to mining intelligence out of the web, and many of these organizations would greatly benefit from collaborating with or learning from others taking different approaches to the push forward. Just as we're always looking for ways to ultimately better allow our users to forage the web, and new audiences or organizations who would find our search and research services particularly useful, we hope the Kosmix team will consider venturing a bit further from the traditional way users interact with a search engine, in order to maximize the usefulness of their search engine.

Yes, Keyword Search is About to Hit its Breaking Point

Tuesday, May 27, 2008

A debate rippled across a few tech blog sites following Erick Schonfeld's reiteration, a few weeks ago, of some claims made by Nova Spivack concerning the fate of traditional keyword search. As Schonfeld explains, Spivack is of the opinion that as the number of web pages a search engine has to sift through explodes exponentially, the efficacy of a simple keyword search will drop off. Spivack himself explains the problem as follows:

"Keyword search engines return haystacks, but what we really are looking for are the needles. The problem with keyword search such as Google's approach is that only highly cited pages make it into the top results. You get a huge pile of results, but the page you want the — 'needle' you are looking for — may not be highly cited by other pages and so it does not appear on the first page. This is because keyword search engines don't understand your question, they just find pages that match the words in your question."

Obviously, the idea is to herald the rise of the Semantic Web and semantic search technologies, which Spivack believes will inevitably supplant the traditional Google search. But such boasts can't long go unchallenged, as naysayers had started saying their nays within a few hours of Schonfeld's article's appearance on TechCrunch. Chris Edwards and Ian Lurie promptly trumpeted their enduring allegiance to good old-fashioned keyword search. Lurie's case for the object of his loyalty is a pragmatic one: keyword search seems to work pretty well for 99% of the population, and the Semantic Web will never happen because it requires universal adoption of a new standard. Edwards takes his argument a little further, pointing out that keyword search isn't quite as simple as searching for documents in which a word or words appear. Rather, keyword search engines use statistical machinery that has been evolving since the 70s to retrieve the documents that are probably the most closely aligned with what the user means by a particular query, and this statistical machinery will keep evolving to meet future challenges. And also, the Semantic Web will never happen because it requires universal adoption of a new standard.

I can't argue with the claims that keyword search has done a bang up job so far or that the Semantic Web is far from an inevitability, but these arguments are missing the point. The overwhelming majority of Googlers only venture out to the third page of their search results, at most. In those first couple pages, there will be a few results that are close to what the user was looking for, a few that are way off, and hopefully a few that are close enough to count as the 'right' results. Since keyword search engines use statistical methods, you'd expect there to be some degree of noise in the results, in some form or another. And there is: all those results that had nothing to do with what the user is looking for. Add to that the results that may be pretty close in an objective sense, but are still of no use to the user, and a few redundant results just waste space on the results page without saying much more than the result before, and you start to see how cluttered with noise those first few pages of the search can be. Now, if the number of web pages being searched increases dramatically, the number of pages returned for the average search will see a dramatic increase as well, and the amplitude of that noise sitting on top of the results will also increase. The undeniable consequence is that, all other things being equal, as the results become significantly greater in number, it become less likely that the best result will actually be on the first page of results, or the second, or even the third.

As Edwards points out, the statistical methods used by the big search engines are being constantly optimized. But are they being optimized fast enough to keep up with an exponentially exploding problem? A study conducted by iProspect and Jupiter Research concluded in 2006 that the ability of search engines to ensure that the object of the user's search was in the first page or two of the results was still increasing. However, it's clear from the data that the increase was swiftly decelerating, apparently approaching the exact sort of plateau Spivack prophecies, with only a 2% increase in average search engine efficacy from 2004 to 2006. What keyword loyalists are missing is that, regardless of how effective keyword search has been in the past or how unlikely Semantic Web technologies are to equal such success, there are dark clouds looming large on the horizon for the search engine status quo.

But if a problem is indeed developing, which requires something of a tectonic shift in the way search is done, and the Semantic Web isn't the answer, then what is? I stumbled across yet another response to Schonfeld's article, which I believe contains about half the answer to this question. As the author notes, when we're faced with too much information for focused searches to be productive, we rely on an unfocused method of discovery to absorb interesting information. It's hard to know what queries to give in searching for new music, because there are just so many different things out there, so we rely on recommendations from friends, magazines, or websites, or chance encounters with singles. There's no doubt that people will come to rely more and more on discovery-based web browsing in addition to search, as just in the last couple years we've seen an explosion of popularity of sites that facilitate such discoveries, such as Digg, Reddit and Twitter. So, at least some part of the answer to the above question seems to be that, in order to satisfy people's needs for interesting information, new tools will have to be developed for efficiently facilitating the serendipitous encounter of new and intriguing material on the web.

I do believe the tools that would allow users to more efficiently discover useful information on the web in this manner, and the tools that would better enable users to filter out the noise in focused searches I alluded to above, might just be one and the same. By providing users with an intuitive representation of the composition of the body of search results and the conceptual interconnections between them, along with an unprecedented level of control and dexterity in manipulating search results, one enables the user to throw out the noise and focus on what's interesting, or follow a hot trail and stumble upon new information well off the beaten path. That sounds like a pretty good solution to me.