Literature Review 2008 - 2009

Friday, November 06, 2009

Machine Learning

G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research 8 (Oct):2405-2441, 2007.

Summary: the lowbow framework captures topic trends in a document by applying a "local smoothing kernel to smooth the original word sequence temporally." (3)  This provides a metric for the distance between two documents: the integrand of the path distance over the documents (11) or a diffusion kernel (11).  Dynamic time warping can be used to reduce the effects of word order (25) although this proved negligibly helpful in experiments, perhaps because of a homogenous corpus or the robustness of the algorithm (26).  Linear differential operators can be applied to the curve to find topic trends (tangent vector field) and document variability (integrand of the curvature tensor norm) (12).


D. Blei and J. Lafferty. A correlated topic model of Science. Annals of Applied Statistics. 1:1 17-35. (PDF) (shorter version from NIPS 18) (code)(browser) April 2007.

Summary: The CTM is able to capture correlation between topics (unlike LDA).  Applied algorithm to the Science journal corpus to generate a topic model.  Used a graphing algorithm to describe connections between latent topics.  Associated each document with a latent vector of topics, which allows a user to browse documents by topic(s) they are related to.


G. Kumaran and J. Allan. Effective and Efficient User Interaction for Long Queries, Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 11-18. July 2008.

Summary: Selective interactive reduction and expansion (SIRE) of queries is used so users can interactively narrow returned search results.


X. Yi and J. Allan.
Evaluating Topic Models for Information Retrieval, to appear in the Proceedings of ACM 17th Conference on Information and Knowledge Management, Napa Valley, CA, October 26-30, 2008.
Summary: From the abstract, "
(1) topic models are effective for document smoothing; (2) more elaborate topic models that capture topic dependencies provide no additional gains; (3) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (4) topics discovered on the whole corpus are too coarse-grained to be useful for query expansion. Experiments to measure topic models' ability to predict held-out likelihood confirm past results on small corpora, but suggest that simple approaches to topic model are better for large corpora."

D. Mimno, A. McCallum. Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, To appear in UAI, 2008.
Summary: The authors propose Dirichlet-multinominal regression (DMR) which generates the prior distribution over topics specific to each document and based upon its observed features.  In results predict topics from author documents and also define a prior over topics and use for author prediction of documents.  Impressive performances in comparison with LDA and AT models (5).  Sampling phase in DMR no more complex (slow) than LDA (7).  Could potentially be extended to a hierarchical model (7).

Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Andrew McCallum, Xuerui Wang and Andres Corrada-Emmanuel. Journal of Artificial Intelligence Research (JAIR), 2007.
Summary: Presents Author-Recipient-Topic (ART) model in which each topic is a multinomial distribution over words and each author, recipient pair is a multinomial distribution over topics. People can be clustered based upon topics to elicit roles independent of the network of people they are connected to (2). Role-Author-Recipient-Topic (RART) generates explicit roles and conditions the topics based upon them (3).  Potential extension to incorporate temporal information into the model (7).  Can generate topic relevance measures for an author relative to the roles they occupy (20).

G. Xue, et al. Implicit Link Analysis for Small Web Search. 2003.
Summary: Differentiates navigational links, recommendation links, and implicit recommendation links that are based on usage patterns.  Implicit links correspond to log entries, an implicit path is a path through implicit links, an explicit path is a path from one implicit link to another with explicit links in between.  The weight of an edge in the implicit link graph is it's normalized support (3).  Experimentally the higher the support the more precise the implicit link and an implicit link is to be a recommendation link than an explicit link (4).  In experiments using pages ranked by people as the ideal implicit PageRank showed considerable improvements over full text, explicit PageRank, DirectHit, and modified-HITS algorithms (5).  Shows that in small webs links do not necessarily meet the requirement of being recommendations that is needed for PageRank to function effectively.

C. Wang, D. Blei, and D. Heckerman. Continuous time dynamic topic models. In Uncertainty in Artificial Intelligence (UAI), 2008. (PDF).

X. Wang, A. McCallum. Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. KDD, August 2006.

A. Grubber, M. Rosen-Zvi, Y. Weiss. Latent Topic Models for Hypertext
. 2008.

An Introduction to Conditional Random Fields for Relational Learning


Medical


G. Luo, C. Tang, H. Yang, and X. Wei. MedSearch: A Specialized Search Engine for Medical Information Retrieval. [pdf] Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), industry session, Napa Valley, CA, Oct. 2008, pp. ?-?.
Summary: Medical search different because users search more exploratory, search through related information.  Queries are generally longer and Google has a 32 word limit, other engines generally have length limits.  MedSearch drops unimportant terms, returns diverse search results, and suggests medical phrases relevant to query.  Uses an establish hierarchical ontology of medical terms (MeSH) to identify and rank medical phrases in the returned top web pages (3).

G. Luo, C. Tang. On Iterative Intelligent Medical Search. [pdf] Proc. 2008 Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'08), Singapore, July 2008, pp. 3-10.
Summary: First, relevant symptoms and signs are automatically suggested based on the searcher's description of his situation. Second, instead of taking for granted the searcher's answers to the questions, iMed ranks and recommends alternative answers according to their likelihoods of being the correct answers. Third, related MeSH medical phrases are suggested to help the searcher refine his situation description.

Enterprise

Hawking, D. (2004). Challenges in Enterprise Search. In Proc. Fifteenth Australasian Database Conference (ADC2004), Dunedin, New Zealand. CRPIT, 27. Schewe, K.-D. and Williams, H. E., Eds. ACS. 15-24.
Summary: Review of enterprise search problems including "3.5 Estimating importance of non-web documents" (6) to which topic based probability linking seems a good solution.  Exploiting context is also mentioned (3.6, 6).  The proposal to convert documents to HTML so that they can be analyzed as normal is criticized for lack of link information documents would have.

Parsing

Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors. Hanna Wallach, Charles Sutton, Andrew McCallum. In International Conference on Machine Learning, Workshop on Prior Knowledge for Text and Language Processing. (ICML WS), 2008.
Summary: Describe two hierarchical models for Bayesian dependency trees and use them to parse sentences.  Use latent variables to cluster parent-child dependencies resulting in improved parse accuracy.

Bibliometrics

B. Zhang, et al.  Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification. Annual ACM Conference on Research and Development in Information Retrieval.  2005.
Summary: Use GP to combine various bibliometrics with +, *, /, sqrt (why no -?).  Found GP trees perform better than individual metrics and slightly better than SVM, although SVM greatly outperform there method on some categories.  Is there a way to choose the better of SVM and GP per category?  Paper is poorly written.

User Interaction Design
M. Schmettow. User Interaction Design Patterns for Information Retrieval  Systems.

ML Software & Toolkits

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

R. Bekkerman. MDC Documentation

Summary: The MDC Toolkit takes a contingency table and parameters file and returns clusters.

How Helioid Benefits Users

Tuesday, November 03, 2009

The simple answer to how Helioid benefits users is that Helioid represents information and information navigation in a more efficient manner. This gets a complex when looking at how each individual uses the internet and searches for information, but still the core is the same. A current issue with web search, as Google's Marissa Mayer "explains":http://www.techcrunch.com/2008/09/10/marissa-mayer-clarifies-search-is-only-10-done-not-90/, is that it is undeveloped and not advanced, "Think of it like biology and physics in the 1500s or 1600s: it's a new science where we make big and exciting breakthroughs all the time."

h4. Some specific problems in this new science:

  • No representation of relationships between search results on a content or conceptual level ** Information connections are "in the dark"
  • No (or extremely little) representation of relationships between search results on a physical hyperlink level
  • Searchers must play guessing game of coming up with correct keywords in order to get more relevant results
  • No ability to refine search or to narrow in on correct result or concept ** Searchers must start over with every new search they execute

h4. Some of the solutions Helioid provides:

  • Clear representations of connections between search results ** On a concept level ** On a link level
  • Clear representations of categories that a search result and groups of search results fall into ** Representation subject matter shared between different results for the same query
  • Can refine search so and therefore do not have to start over from scratch ** Can eliminate results or terms from search results to refine search and zoom in on wanted information ** Can save results and search query and then continue searching later

On Kosmix and Needles in Haystacks

Friday, January 16, 2009

A little over a week ago, TechCrunch featured an "article":http://www.techcrunch.com/2008/12/08/kosmix-raises-20-million-more-for-its-universal-search-engine/ on the latest round of funding raise by rising star in web search, Kosmix. In said latest round, Kosmix managed to rake in an impressive $20 million from a wide range of investors, led by Time Warner, bringing the search engine's total funding to $55 million. Upon paying a visit to their site, it becomes immediately apparent what all the hoopla's about. Kosmix pulls the top search results from a variety of popular sources in a variety of different categories, including video sites YouTube and Truveo, info sites like Wikipedia and HowStuffWorks, and shopping sites like Amazon and Ebay, in order to create a mash-up of all the possible kinds of information you might be interested in. A collection of related subjects are also presented in the left margin of the results page, in order to facilitate some degree of search refinement. Without a doubt, Kosmix provides a search experience quite distinct from the major search engines, and I feel fairly safe in saying that most searches performed with Ask would yield more fulfilling results with Kosmix. And yet, after playing with Kosmix for a while, I felt as though something was amiss.

In the various articles I've read about Kosmix, and on founder Anand Rajaraman's own blog, the approach employed is described as “horizontal search” or alternatively as “creating a homepage for any topic.” The idea being that if the user doesn't know exactly what he is looking for, he can enter a query and get a cross section of everything the web has to offer on the subject, in any conceivably relevant context. For example, searching for “fusion” gets you a snippet from a Wikipedia article on Nuclear fusion, a few price quotes for recent Ford Fusion models, a couple HowStuffWorks articles on modern nuclear fusion reactors, a couple audio clips from Sat Mahaori's album, the Khmer Fusion Project, a pair of articles from Helium.com on fusion energy and nuclear fusion bombs, a video of a live jazz fusion performance, a news article on recent scientific advancements in nuclear fusion, and the list goes on and on. Certainly, if I had literally no idea what I wanted other than that it involved the word "fusion" I would be in hog heaven. Of course, searching for “fusion” on Google gets a similarly wide range of different types of results, as the word has a wide range of uses. However, by displaying the results simply in order of popularity, and for queries as loaded as “fusion” providing a set of related, but more specific queries, if the user is looking for quick satisfaction in the form of a specific piece of information, in the vast majority of cases, Google provides it.

Rajaraman justifies the Kosmix approach of collecting a myriad of different sets of search results on one page in a blog article entitled “Searching For a Needle or Exploring the Haystack?” Obviously, the approach taken by Google and the other major search engines is identified as needle-searching. Rajaraman explains that Kosmix is best suited for exploration of the web, without knowing what exactly you're looking for. This definitely resonates with some of the applications we've envisioned for Helioid, but something still seems to be missing, and part of that missing something is identified in the comments made to Rajaraman's blog post. One commenter notes that the noise-to-signal ratio in Kosmix is unacceptably high, to which Rajaraman responds by saying that this is bound to be the case in web "exploration." Another commenter says that he understands the appeal of a broad range of information, but feels it's as though the floodgates are simply being opened in Kosmix, and he would prefer a more concise representation of the spectrum of available information, with the ability to “drill down” into specific topics or categories. Yet another commenter says that he too understands the appeal of haystack exploration, but that one varyingly wants to search for specific information, and explore the web aimlessly. And there are many more along these lines.

It seems as though a fair number of the people giving Kosmix a try are seeing the same problems that I am, and it also seems like these same people would benefit from the solutions we intend to offer when Helioid branches into general web search. The first blog article I wrote for Helioid was about the decline of Google's efficacy as the number of pages being searched over increases exponentially. It included the argument that, as the number of pages increases exponentially, and the noise-to-signal ratio in Google's search algorithms is not sufficiently decreased, you'll see more noise in the results in the first few pages, eventually displacing some valuable search results and making it less likely, on average, that one would find what they're looking for on the first page, or even in the first few pages. I further argued that one solution to this problem also lent itself quite handily to the issue of getting as much breadth and depth as one wants in a series of web searches, and that was simply organizing the search engine in such a way as to maximize the dexterity of the user in navigating the search results (the article was written before Google started widely using related search terms). The user should be able to eliminate all the noise of a particular type in the results, or focus the results around a particular category, or the intersection between categories. Although Kosmix provides related search terms, clicking on these suggested queries does little in the way of "drilling down" into more specific collections of information. The results are always presented in the same "homepage for any topic" form, and thus will always favor breadth over depth, even to the point of making it near impossible to get any manner of depth of information about a given topic. And the depth achieved by simply selecting a category to drill down into still, in general, falls far short of what ought to be afforded to the user. By allowing users to broaden or narrow the categories over which they are searching, by providing the means to collect information from a range of sources into a coherent answer to a given question, and by supporting the intuitive navigation from a given web page to related pages, allowing for the exploration of a "topic space" which can be as focused or as unfocused as the user chooses, Helioid will maximize the dexterity with which anyone can browse the web, and so provide a one-stop-shop for searching for needles AND exploring haystacks.

Composing Inverse Functions to Measure Model Fitness

Friday, December 05, 2008

This articles concerns a method for evaluating the fitness of content topic models and document topic models based on the dissonance between a set of documents and the set of documents generated by composing inverse functions and applying them to the original set of documents. A document generating function is applied to a topic generating function that is applied to the original set of documents. In order to compare topics, one can look at the original set of topics compared to set of topics generated by apply a topic generating function to the documents generated by applying a document generating function to the original set of topics.

In the notation below let D be a corpus, T be a set of topics, F be a set of content topic models such that each member has domain a subset of D and range a subset of T, and let G be a set of topic document models such that each member has domain a subset of T and range a subset of D. (In other words f({d1, ...}) = {t1, ...} and g({t1, ...}) = {d1, ...}.) Let bold face represent sets and italics represent functions or singletons, thus: f represents a member of F, d a member of D, etc. Let "*" represent the dot operator when applied to sets and multiplication when applied to scalars, |A| is a scalar representing the cardinality of the set A.

To measure the difference between d and g(f(d)) we may look at the value of:

(1) d * g(f(d))

It is arguable that a more fit f or g will result in greater values for (1). We can also look at a comparison between g(f1(d)) and g(f2(d)). Under certain circumstances, it may be reasonable to assume that if:

(2) d * g(f1(d)) > d * g(f2(d))

the topic generating function f1 is more fit than f2.

What (2) says is that the number of documents both in our original set and in the set of documents generated (by applying a document generating function to a topic generating function applied to that original set) is larger if we use the topic generating function f1 versus f2. As an analogy, suppose I show a bunch of books to two people, Brin and Sergey. Brin says all my books are from one set of genres and Sergey says they're from another (the genres they choose may overlap). Now suppose I take Brin's genres and Sergey's genres and separately feed them into Amazon which returns two sets of books: Brin's books and Sergey's books. If it turns out that there are only a couple of Brin's books in common with my original books, while more than half of Sergey's books are in my original set of books, I'd consider this a good reason to assume Sergey's genres were more accurate than Brin's.

Of course it could be that Sergey just got lucky with his topics, or that Brin was unlucky. It could also have been that the Amazon algorithm returned more documents for Sergey's genres and his chances were thereby improved, or that the Amazon algorithm is wrong and its incorrect results are merely misleading me into thinking Sergey's genres are relevant. Additionally, Sergey could have given me a slew of genres to improve his chances. The three problems here are (A) the size of the set of documents generated, (B) the size of the set of topics given, and (C) the fitness of document generation algorithm.

To guard against (A) the size of the set of documents that is generated we bias towards a large number of generated documents overlapping with our original documents and against those generated documents that are not in the original set of documents. To do this we compute the dot product between d and g(f(d)) then normalize.

(3) (d * g(f(d)))/(|d| * |g(f(d))|)

To discount (B) the size of the set of topics generated we normalize for the size of that set.

(4) (d * g(f(d)))/(|d| * |g(f(d))| * |f(d)|)

To account for (C) poor fitness of the document generation algorithm we may take the sum of (4) over a set of document generating algorithms applied to the same topic generating algorithm.

(5) sum_i (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

This representation has given us a way to compare the accuracy of Brin's and Sergey's book genres that will be robust against both gaming by giving us lots of genres and the quality of the algorithm that chooses what books are in the genres we are given. So, back to content topic models, we can take a set of topics models F and assign a fitness to each one using equation (5) and a subset of the document models in D. We can now generate a fitness for each document model in D and weight it according to the fitness of the topic model used. Define s(f) (or s(g)) to be equal to the fitness assigned to f (or g) by equation (5) or one of (6) or (7) below.

(6) sum_i (s(fi) * (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

(7) sum_i (s(gi) * (d * gi(f(d)))/(|d| * |gi(f(d))| * |f(d)|)

Equations (6) and (7) can be repeatedly evaluated to tune the fitness assigned to the members of F and D.

The abstraction of the basic idea is quite simple: use a consortium of functions, weighted by their fitness, to evaluated the fitness of another function. There are many variations and extensions that could be used. We are still developing this idea and more exploration regarding the use of a feedback loop to tune fitness is in order.