Literature Review 2008 - 2009

Friday, November 06, 2009

Machine Learning

G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research 8 (Oct):2405-2441, 2007.

Summary: the lowbow framework captures topic trends in a document by applying a "local smoothing kernel to smooth the original word sequence temporally." (3)  This provides a metric for the distance between two documents: the integrand of the path distance over the documents (11) or a diffusion kernel (11).  Dynamic time warping can be used to reduce the effects of word order (25) although this proved negligibly helpful in experiments, perhaps because of a homogenous corpus or the robustness of the algorithm (26).  Linear differential operators can be applied to the curve to find topic trends (tangent vector field) and document variability (integrand of the curvature tensor norm) (12).


D. Blei and J. Lafferty. A correlated topic model of Science. Annals of Applied Statistics. 1:1 17-35. (PDF) (shorter version from NIPS 18) (code)(browser) April 2007.

Summary: The CTM is able to capture correlation between topics (unlike LDA).  Applied algorithm to the Science journal corpus to generate a topic model.  Used a graphing algorithm to describe connections between latent topics.  Associated each document with a latent vector of topics, which allows a user to browse documents by topic(s) they are related to.


G. Kumaran and J. Allan. Effective and Efficient User Interaction for Long Queries, Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 11-18. July 2008.

Summary: Selective interactive reduction and expansion (SIRE) of queries is used so users can interactively narrow returned search results.


X. Yi and J. Allan.
Evaluating Topic Models for Information Retrieval, to appear in the Proceedings of ACM 17th Conference on Information and Knowledge Management, Napa Valley, CA, October 26-30, 2008.
Summary: From the abstract, "
(1) topic models are effective for document smoothing; (2) more elaborate topic models that capture topic dependencies provide no additional gains; (3) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (4) topics discovered on the whole corpus are too coarse-grained to be useful for query expansion. Experiments to measure topic models' ability to predict held-out likelihood confirm past results on small corpora, but suggest that simple approaches to topic model are better for large corpora."

D. Mimno, A. McCallum. Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, To appear in UAI, 2008.
Summary: The authors propose Dirichlet-multinominal regression (DMR) which generates the prior distribution over topics specific to each document and based upon its observed features.  In results predict topics from author documents and also define a prior over topics and use for author prediction of documents.  Impressive performances in comparison with LDA and AT models (5).  Sampling phase in DMR no more complex (slow) than LDA (7).  Could potentially be extended to a hierarchical model (7).

Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Andrew McCallum, Xuerui Wang and Andres Corrada-Emmanuel. Journal of Artificial Intelligence Research (JAIR), 2007.
Summary: Presents Author-Recipient-Topic (ART) model in which each topic is a multinomial distribution over words and each author, recipient pair is a multinomial distribution over topics. People can be clustered based upon topics to elicit roles independent of the network of people they are connected to (2). Role-Author-Recipient-Topic (RART) generates explicit roles and conditions the topics based upon them (3).  Potential extension to incorporate temporal information into the model (7).  Can generate topic relevance measures for an author relative to the roles they occupy (20).

G. Xue, et al. Implicit Link Analysis for Small Web Search. 2003.
Summary: Differentiates navigational links, recommendation links, and implicit recommendation links that are based on usage patterns.  Implicit links correspond to log entries, an implicit path is a path through implicit links, an explicit path is a path from one implicit link to another with explicit links in between.  The weight of an edge in the implicit link graph is it's normalized support (3).  Experimentally the higher the support the more precise the implicit link and an implicit link is to be a recommendation link than an explicit link (4).  In experiments using pages ranked by people as the ideal implicit PageRank showed considerable improvements over full text, explicit PageRank, DirectHit, and modified-HITS algorithms (5).  Shows that in small webs links do not necessarily meet the requirement of being recommendations that is needed for PageRank to function effectively.

C. Wang, D. Blei, and D. Heckerman. Continuous time dynamic topic models. In Uncertainty in Artificial Intelligence (UAI), 2008. (PDF).

X. Wang, A. McCallum. Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. KDD, August 2006.

A. Grubber, M. Rosen-Zvi, Y. Weiss. Latent Topic Models for Hypertext
. 2008.

An Introduction to Conditional Random Fields for Relational Learning


Medical


G. Luo, C. Tang, H. Yang, and X. Wei. MedSearch: A Specialized Search Engine for Medical Information Retrieval. [pdf] Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), industry session, Napa Valley, CA, Oct. 2008, pp. ?-?.
Summary: Medical search different because users search more exploratory, search through related information.  Queries are generally longer and Google has a 32 word limit, other engines generally have length limits.  MedSearch drops unimportant terms, returns diverse search results, and suggests medical phrases relevant to query.  Uses an establish hierarchical ontology of medical terms (MeSH) to identify and rank medical phrases in the returned top web pages (3).

G. Luo, C. Tang. On Iterative Intelligent Medical Search. [pdf] Proc. 2008 Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'08), Singapore, July 2008, pp. 3-10.
Summary: First, relevant symptoms and signs are automatically suggested based on the searcher's description of his situation. Second, instead of taking for granted the searcher's answers to the questions, iMed ranks and recommends alternative answers according to their likelihoods of being the correct answers. Third, related MeSH medical phrases are suggested to help the searcher refine his situation description.

Enterprise

Hawking, D. (2004). Challenges in Enterprise Search. In Proc. Fifteenth Australasian Database Conference (ADC2004), Dunedin, New Zealand. CRPIT, 27. Schewe, K.-D. and Williams, H. E., Eds. ACS. 15-24.
Summary: Review of enterprise search problems including "3.5 Estimating importance of non-web documents" (6) to which topic based probability linking seems a good solution.  Exploiting context is also mentioned (3.6, 6).  The proposal to convert documents to HTML so that they can be analyzed as normal is criticized for lack of link information documents would have.

Parsing

Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors. Hanna Wallach, Charles Sutton, Andrew McCallum. In International Conference on Machine Learning, Workshop on Prior Knowledge for Text and Language Processing. (ICML WS), 2008.
Summary: Describe two hierarchical models for Bayesian dependency trees and use them to parse sentences.  Use latent variables to cluster parent-child dependencies resulting in improved parse accuracy.

Bibliometrics

B. Zhang, et al.  Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification. Annual ACM Conference on Research and Development in Information Retrieval.  2005.
Summary: Use GP to combine various bibliometrics with +, *, /, sqrt (why no -?).  Found GP trees perform better than individual metrics and slightly better than SVM, although SVM greatly outperform there method on some categories.  Is there a way to choose the better of SVM and GP per category?  Paper is poorly written.

User Interaction Design
M. Schmettow. User Interaction Design Patterns for Information Retrieval  Systems.

ML Software & Toolkits

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

R. Bekkerman. MDC Documentation

Summary: The MDC Toolkit takes a contingency table and parameters file and returns clusters.

Composing Inverse Functions to Measure Model Fitness

Friday, December 05, 2008

This articles concerns a method for evaluating the fitness of content topic models and document topic models based on the dissonance between a set of documents and the set of documents generated by composing inverse functions and applying them to the original set of documents. A document generating function is applied to a topic generating function that is applied to the original set of documents. In order to compare topics, one can look at the original set of topics compared to set of topics generated by apply a topic generating function to the documents generated by applying a document generating function to the original set of topics.

In the notation below let D be a corpus, T be a set of topics, F be a set of content topic models such that each member has domain a subset of D and range a subset of T, and let G be a set of topic document models such that each member has domain a subset of T and range a subset of D. (In other words f({d1, ...}) = {t1, ...} and g({t1, ...}) = {d1, ...}.) Let bold face represent sets and italics represent functions or singletons, thus: f represents a member of F, d a member of D, etc. Let "*" represent the dot operator when applied to sets and multiplication when applied to scalars, |A| is a scalar representing the cardinality of the set A.

To measure the difference between d and g(f(d)) we may look at the value of:

(1) d * g(f(d))

It is arguable that a more fit f or g will result in greater values for (1). We can also look at a comparison between g(f1(d)) and g(f2(d)). Under certain circumstances, it may be reasonable to assume that if:

(2) d * g(f1(d)) > d * g(f2(d))

the topic generating function f1 is more fit than f2.

What (2) says is that the number of documents both in our original set and in the set of documents generated (by applying a document generating function to a topic generating function applied to that original set) is larger if we use the topic generating function f1 versus f2. As an analogy, suppose I show a bunch of books to two people, Brin and Sergey. Brin says all my books are from one set of genres and Sergey says they're from another (the genres they choose may overlap). Now suppose I take Brin's genres and Sergey's genres and separately feed them into Amazon which returns two sets of books: Brin's books and Sergey's books. If it turns out that there are only a couple of Brin's books in common with my original books, while more than half of Sergey's books are in my original set of books, I'd consider this a good reason to assume Sergey's genres were more accurate than Brin's.

Of course it could be that Sergey just got lucky with his topics, or that Brin was unlucky. It could also have been that the Amazon algorithm returned more documents for Sergey's genres and his chances were thereby improved, or that the Amazon algorithm is wrong and its incorrect results are merely misleading me into thinking Sergey's genres are relevant. Additionally, Sergey could have given me a slew of genres to improve his chances. The three problems here are (A) the size of the set of documents generated, (B) the size of the set of topics given, and (C) the fitness of document generation algorithm.

To guard against (A) the size of the set of documents that is generated we bias towards a large number of generated documents overlapping with our original documents and against those generated documents that are not in the original set of documents. To do this we compute the dot product between d and g(f(d)) then normalize.

(3) (d * g(f(d)))/(|d| * |g(f(d))|)

To discount (B) the size of the set of topics generated we normalize for the size of that set.

(4) (d * g(f(d)))/(|d| * |g(f(d))| * |f(d)|)

To account for (C) poor fitness of the document generation algorithm we may take the sum of (4) over a set of document generating algorithms applied to the same topic generating algorithm.

(5) sum_i (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

This representation has given us a way to compare the accuracy of Brin's and Sergey's book genres that will be robust against both gaming by giving us lots of genres and the quality of the algorithm that chooses what books are in the genres we are given. So, back to content topic models, we can take a set of topics models F and assign a fitness to each one using equation (5) and a subset of the document models in D. We can now generate a fitness for each document model in D and weight it according to the fitness of the topic model used. Define s(f) (or s(g)) to be equal to the fitness assigned to f (or g) by equation (5) or one of (6) or (7) below.

(6) sum_i (s(fi) * (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

(7) sum_i (s(gi) * (d * gi(f(d)))/(|d| * |gi(f(d))| * |f(d)|)

Equations (6) and (7) can be repeatedly evaluated to tune the fitness assigned to the members of F and D.

The abstraction of the basic idea is quite simple: use a consortium of functions, weighted by their fitness, to evaluated the fitness of another function. There are many variations and extensions that could be used. We are still developing this idea and more exploration regarding the use of a feedback loop to tune fitness is in order.

Deficiencies of non-Non-linear Learning

Tuesday, November 25, 2008

Non-linear learning is at the heart of the algorithms that underlie what makes Helioid's search effective. To examine non-linear learning let us first define the term in contrast with linear learning. (I would like to briefly note that these definitions, though culled from machine learning and logic literature, are primarily my own and I expect some to disagree with them).

Linear learning occurs when each step in a procedure depends solely on deductions made from the previous steps. Furthermore, the deductions must be valid according to some set of rules. To take the example of modus ponens from formal logic, suppose we know '"it is daytime"' and that 'if "it is daytime" then "we can see the sun"'. Given a standard model theory we may deduce, "we can see the sun" by detachment. In search, the reasonable assumption is made that 'if "Item A appears above Item B in search results" then "Item A is more relevant than Item B"' and Item A should be displayed with more prominence to the user, normally closer to the top of the list of search results. This line of reasoning is perfectly acceptable but there are times when it quickly runs into trouble for the searcher.

Suppose I am writing music visualization software and I need information about the Processing graphics libraries' particle system. My first query might be, "processing graphics library particle system". Try it on "Google":http://www.google.com/search?q=processing+graphics+library+particle+system. The results I get aren't bad for what I was looking for. The 6th entry links to a "page on processing.org":http://processing.org/reference/libraries/ from which I can find another link to Jeffrey Traer Bernstein's "page":http://www.cs.princeton.edu/~traer/physics/, which has a particle system library for Processing.

But, going back to Google's results, a look through the top 500 finds no direct link to Traer's page! Looking through the results the problem may be phrase based. There are lots of results for "graphics" and "library" and "particle" and "system". Being a well trained Google user I'd wrap "particle system" in quotes and "re-search":http://www.google.com/search?q=processing+graphics+library+%22particle+system%22. I also might switch "library" to "language" and try to find a phrase I could combine "processing" with.

But still I'd have no luck getting Traer's page within the top 500 results and most of the results would be irrelevant to my query (the two above linked queries were tested on November 24 2008). One problem is that I can't interactively say, "this type of result is irrelevant, get rid of it." I could add some exclusionary keywords or phrases to my query but that would require a good amount of guess work on my part. I would need to find the right keywords or key-phrases to exclude and there is no guarantee I will be able to do this effectively or even helpfully.

Another problem is that I can't say, "this type is on the right track, give me more of these," without making adjustments to my query. The problem here is that when I make these adjustments to my query the initial work I've done in sorting through the results will be lost when my new results are shown.

Google's failure to present the user with relevant results stems from an inability to capture user feedback and implement a method for users to provide feedback. The failure in part stems from the assumption that a user wants to go through their search results in a linear fashion. Here are a couple ways in which Helioid's non-linear approach to searching helps solve this problem.

1. Let the user tell the search engine what they want.

The user knows when a result is irrelevant, and they should be able to get rid of it. The user also knows when a result is relevant and they should be able to bias the algorithm towards results of this type. Some users may know when an entire class of results is irrelevant (or relevant). They should be able to bias their search results against (or for) showing results that are members of this class. In the example given above I know I don't care about physics papers so, using Helioid, I can tell the search engine my opinion and it will bias the results is returns against them.

This is a non-linear method because the search results (the deductions) are being modified in a way that is inconsistent with the rules by which they were generated (the theory).

2. Have the search engine learn from what the user does.

In many cases it may be unnecessary for the user to explicitly tell the search engine what they want. Going back to our "Processing library" example, after the user is seen ignoring all the pages related to all the academic papers it may be appropriate to assume the user is uninterested in them. Maybe even ask the user if they'd like them to be removed.

For the same reasons as above this is another example of non-linear learning.

Assuming that the algorithm knows the answer tout court is a dreadful mistake. The most important input to the algorithm should be the desires of the user as they are expressed in keywords and through actions. If a user is found scrolling through multiple pages this is an input and the search results should be adjusted to take this new knowledge into account. Ignoring user input is the same as not adjusting your reefing after the wind shifts, you'll go off-course.