Composing Inverse Functions to Measure Model Fitness

Friday, December 05, 2008

This articles concerns a method for evaluating the fitness of content topic models and document topic models based on the dissonance between a set of documents and the set of documents generated by composing inverse functions and applying them to the original set of documents. A document generating function is applied to a topic generating function that is applied to the original set of documents. In order to compare topics, one can look at the original set of topics compared to set of topics generated by apply a topic generating function to the documents generated by applying a document generating function to the original set of topics.

In the notation below let D be a corpus, T be a set of topics, F be a set of content topic models such that each member has domain a subset of D and range a subset of T, and let G be a set of topic document models such that each member has domain a subset of T and range a subset of D. (In other words f({d1, ...}) = {t1, ...} and g({t1, ...}) = {d1, ...}.) Let bold face represent sets and italics represent functions or singletons, thus: f represents a member of F, d a member of D, etc. Let "*" represent the dot operator when applied to sets and multiplication when applied to scalars, |A| is a scalar representing the cardinality of the set A.

To measure the difference between d and g(f(d)) we may look at the value of:

(1) d * g(f(d))

It is arguable that a more fit f or g will result in greater values for (1). We can also look at a comparison between g(f1(d)) and g(f2(d)). Under certain circumstances, it may be reasonable to assume that if:

(2) d * g(f1(d)) > d * g(f2(d))

the topic generating function f1 is more fit than f2.

What (2) says is that the number of documents both in our original set and in the set of documents generated (by applying a document generating function to a topic generating function applied to that original set) is larger if we use the topic generating function f1 versus f2. As an analogy, suppose I show a bunch of books to two people, Brin and Sergey. Brin says all my books are from one set of genres and Sergey says they're from another (the genres they choose may overlap). Now suppose I take Brin's genres and Sergey's genres and separately feed them into Amazon which returns two sets of books: Brin's books and Sergey's books. If it turns out that there are only a couple of Brin's books in common with my original books, while more than half of Sergey's books are in my original set of books, I'd consider this a good reason to assume Sergey's genres were more accurate than Brin's.

Of course it could be that Sergey just got lucky with his topics, or that Brin was unlucky. It could also have been that the Amazon algorithm returned more documents for Sergey's genres and his chances were thereby improved, or that the Amazon algorithm is wrong and its incorrect results are merely misleading me into thinking Sergey's genres are relevant. Additionally, Sergey could have given me a slew of genres to improve his chances. The three problems here are (A) the size of the set of documents generated, (B) the size of the set of topics given, and (C) the fitness of document generation algorithm.

To guard against (A) the size of the set of documents that is generated we bias towards a large number of generated documents overlapping with our original documents and against those generated documents that are not in the original set of documents. To do this we compute the dot product between d and g(f(d)) then normalize.

(3) (d * g(f(d)))/(|d| * |g(f(d))|)

To discount (B) the size of the set of topics generated we normalize for the size of that set.

(4) (d * g(f(d)))/(|d| * |g(f(d))| * |f(d)|)

To account for (C) poor fitness of the document generation algorithm we may take the sum of (4) over a set of document generating algorithms applied to the same topic generating algorithm.

(5) sum_i (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

This representation has given us a way to compare the accuracy of Brin's and Sergey's book genres that will be robust against both gaming by giving us lots of genres and the quality of the algorithm that chooses what books are in the genres we are given. So, back to content topic models, we can take a set of topics models F and assign a fitness to each one using equation (5) and a subset of the document models in D. We can now generate a fitness for each document model in D and weight it according to the fitness of the topic model used. Define s(f) (or s(g)) to be equal to the fitness assigned to f (or g) by equation (5) or one of (6) or (7) below.

(6) sum_i (s(fi) * (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

(7) sum_i (s(gi) * (d * gi(f(d)))/(|d| * |gi(f(d))| * |f(d)|)

Equations (6) and (7) can be repeatedly evaluated to tune the fitness assigned to the members of F and D.

The abstraction of the basic idea is quite simple: use a consortium of functions, weighted by their fitness, to evaluated the fitness of another function. There are many variations and extensions that could be used. We are still developing this idea and more exploration regarding the use of a feedback loop to tune fitness is in order.