Literature Review 2008 - 2009

Friday, November 06, 2009

Machine Learning

G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research 8 (Oct):2405-2441, 2007.

Summary: the lowbow framework captures topic trends in a document by applying a "local smoothing kernel to smooth the original word sequence temporally." (3)  This provides a metric for the distance between two documents: the integrand of the path distance over the documents (11) or a diffusion kernel (11).  Dynamic time warping can be used to reduce the effects of word order (25) although this proved negligibly helpful in experiments, perhaps because of a homogenous corpus or the robustness of the algorithm (26).  Linear differential operators can be applied to the curve to find topic trends (tangent vector field) and document variability (integrand of the curvature tensor norm) (12).


D. Blei and J. Lafferty. A correlated topic model of Science. Annals of Applied Statistics. 1:1 17-35. (PDF) (shorter version from NIPS 18) (code)(browser) April 2007.

Summary: The CTM is able to capture correlation between topics (unlike LDA).  Applied algorithm to the Science journal corpus to generate a topic model.  Used a graphing algorithm to describe connections between latent topics.  Associated each document with a latent vector of topics, which allows a user to browse documents by topic(s) they are related to.


G. Kumaran and J. Allan. Effective and Efficient User Interaction for Long Queries, Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 11-18. July 2008.

Summary: Selective interactive reduction and expansion (SIRE) of queries is used so users can interactively narrow returned search results.


X. Yi and J. Allan.
Evaluating Topic Models for Information Retrieval, to appear in the Proceedings of ACM 17th Conference on Information and Knowledge Management, Napa Valley, CA, October 26-30, 2008.
Summary: From the abstract, "
(1) topic models are effective for document smoothing; (2) more elaborate topic models that capture topic dependencies provide no additional gains; (3) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (4) topics discovered on the whole corpus are too coarse-grained to be useful for query expansion. Experiments to measure topic models' ability to predict held-out likelihood confirm past results on small corpora, but suggest that simple approaches to topic model are better for large corpora."

D. Mimno, A. McCallum. Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, To appear in UAI, 2008.
Summary: The authors propose Dirichlet-multinominal regression (DMR) which generates the prior distribution over topics specific to each document and based upon its observed features.  In results predict topics from author documents and also define a prior over topics and use for author prediction of documents.  Impressive performances in comparison with LDA and AT models (5).  Sampling phase in DMR no more complex (slow) than LDA (7).  Could potentially be extended to a hierarchical model (7).

Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Andrew McCallum, Xuerui Wang and Andres Corrada-Emmanuel. Journal of Artificial Intelligence Research (JAIR), 2007.
Summary: Presents Author-Recipient-Topic (ART) model in which each topic is a multinomial distribution over words and each author, recipient pair is a multinomial distribution over topics. People can be clustered based upon topics to elicit roles independent of the network of people they are connected to (2). Role-Author-Recipient-Topic (RART) generates explicit roles and conditions the topics based upon them (3).  Potential extension to incorporate temporal information into the model (7).  Can generate topic relevance measures for an author relative to the roles they occupy (20).

G. Xue, et al. Implicit Link Analysis for Small Web Search. 2003.
Summary: Differentiates navigational links, recommendation links, and implicit recommendation links that are based on usage patterns.  Implicit links correspond to log entries, an implicit path is a path through implicit links, an explicit path is a path from one implicit link to another with explicit links in between.  The weight of an edge in the implicit link graph is it's normalized support (3).  Experimentally the higher the support the more precise the implicit link and an implicit link is to be a recommendation link than an explicit link (4).  In experiments using pages ranked by people as the ideal implicit PageRank showed considerable improvements over full text, explicit PageRank, DirectHit, and modified-HITS algorithms (5).  Shows that in small webs links do not necessarily meet the requirement of being recommendations that is needed for PageRank to function effectively.

C. Wang, D. Blei, and D. Heckerman. Continuous time dynamic topic models. In Uncertainty in Artificial Intelligence (UAI), 2008. (PDF).

X. Wang, A. McCallum. Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. KDD, August 2006.

A. Grubber, M. Rosen-Zvi, Y. Weiss. Latent Topic Models for Hypertext
. 2008.

An Introduction to Conditional Random Fields for Relational Learning


Medical


G. Luo, C. Tang, H. Yang, and X. Wei. MedSearch: A Specialized Search Engine for Medical Information Retrieval. [pdf] Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), industry session, Napa Valley, CA, Oct. 2008, pp. ?-?.
Summary: Medical search different because users search more exploratory, search through related information.  Queries are generally longer and Google has a 32 word limit, other engines generally have length limits.  MedSearch drops unimportant terms, returns diverse search results, and suggests medical phrases relevant to query.  Uses an establish hierarchical ontology of medical terms (MeSH) to identify and rank medical phrases in the returned top web pages (3).

G. Luo, C. Tang. On Iterative Intelligent Medical Search. [pdf] Proc. 2008 Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'08), Singapore, July 2008, pp. 3-10.
Summary: First, relevant symptoms and signs are automatically suggested based on the searcher's description of his situation. Second, instead of taking for granted the searcher's answers to the questions, iMed ranks and recommends alternative answers according to their likelihoods of being the correct answers. Third, related MeSH medical phrases are suggested to help the searcher refine his situation description.

Enterprise

Hawking, D. (2004). Challenges in Enterprise Search. In Proc. Fifteenth Australasian Database Conference (ADC2004), Dunedin, New Zealand. CRPIT, 27. Schewe, K.-D. and Williams, H. E., Eds. ACS. 15-24.
Summary: Review of enterprise search problems including "3.5 Estimating importance of non-web documents" (6) to which topic based probability linking seems a good solution.  Exploiting context is also mentioned (3.6, 6).  The proposal to convert documents to HTML so that they can be analyzed as normal is criticized for lack of link information documents would have.

Parsing

Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors. Hanna Wallach, Charles Sutton, Andrew McCallum. In International Conference on Machine Learning, Workshop on Prior Knowledge for Text and Language Processing. (ICML WS), 2008.
Summary: Describe two hierarchical models for Bayesian dependency trees and use them to parse sentences.  Use latent variables to cluster parent-child dependencies resulting in improved parse accuracy.

Bibliometrics

B. Zhang, et al.  Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification. Annual ACM Conference on Research and Development in Information Retrieval.  2005.
Summary: Use GP to combine various bibliometrics with +, *, /, sqrt (why no -?).  Found GP trees perform better than individual metrics and slightly better than SVM, although SVM greatly outperform there method on some categories.  Is there a way to choose the better of SVM and GP per category?  Paper is poorly written.

User Interaction Design
M. Schmettow. User Interaction Design Patterns for Information Retrieval  Systems.

ML Software & Toolkits

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

R. Bekkerman. MDC Documentation

Summary: The MDC Toolkit takes a contingency table and parameters file and returns clusters.