Business Intelligence must be Automated

Sunday, December 06, 2009

Business intelligence is now taking off with companies like Endeca and NetSuite fighting for market share along with more established companies such as IBM and Microsoft. The idea of real time integration of all types of business data so that one can make more informed predictions and decisions is simple and obvious. Yet pulling this off in a business environment, with enough data format acronyms to make a reverse engineer cringe, is complex.

The November 26th FT article The final frontier of business advantage shares experiences indicating that it is "the automatic roll-up of data to a central repository" where most missteps in implementing a business intelligence plan are made. As a result, many business intelligence initiatives turn into costly time sinks. One company mentioned in the article says that their "answer for its customers is to convert all the data to one consistent type and store it in one repository." The problem we see with this approach is efficiency, and transparency.

For this approach to work the data must be converted, and they probably mean ALL the data. Then, I assume, all new data must only be collected in this new format. This process is extremely likely to be costly and onerous but, suppose it's not. We've converted decades of business data to a new format and are placing all our new data in this format. But, what if we merge with another company, how do we integrate their data? The same time consuming way? What if the format's insufficient? Normally any format in use becomes insufficient and assuming the opposite is a dangerous path to take.

Helioid's business intelligence tools offer a different solution. We build a proxy between proprietary internal company data and our tools, we integrate data in a transparent manner. That is, a company can keep its data in whatever format it likes and the data will be seamlessly integrated into our products and services. Furthermore, we use common open source formats so that other companies or organizations can integrate your data with their services and formats.

Another interesting quote in the article comes from James McGeever, "I believe that if the same piece of data exists in two places then one will be wrong." Hmm... what our algorithms do is evaluate both pieces of data and based on how they fit into the larger structure of an organization's information architecture they are assigned different probabilities of being correct (or wrong, whatever measure is most useful in the situation at hand). More importantly, it is assumed that both pieces of data, being information, will be useful in making the company more efficient and more intelligently run. Perhaps the duplication shows what was in the past an inefficient collection process, converting the data into one format will ignore this former problem, a problem that could be investigated and learned from.

Improved business intelligence offers the opportunity to improve concrete results, like sales or revenue, however, the path to these improvements must enable companies to be self reflective with respect to their own business processes. A new format and a dashboard that lets you see trends in sales data are the step being taken in current business intelligence products. This will allow for better decision making and will result in concrete results. The algorithms we develop not only improve decisions by making data available but, also address how to improve the use of this data and the dashboard. They address continually improving companies' work flow.

Literature Review 2008 - 2009

Friday, November 06, 2009

Machine Learning

G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research 8 (Oct):2405-2441, 2007.

Summary: the lowbow framework captures topic trends in a document by applying a "local smoothing kernel to smooth the original word sequence temporally." (3)  This provides a metric for the distance between two documents: the integrand of the path distance over the documents (11) or a diffusion kernel (11).  Dynamic time warping can be used to reduce the effects of word order (25) although this proved negligibly helpful in experiments, perhaps because of a homogenous corpus or the robustness of the algorithm (26).  Linear differential operators can be applied to the curve to find topic trends (tangent vector field) and document variability (integrand of the curvature tensor norm) (12).

D. Blei and J. Lafferty. A correlated topic model of Science. Annals of Applied Statistics. 1:1 17-35. (PDF) (shorter version from NIPS 18) (code)(browser) April 2007.

Summary: The CTM is able to capture correlation between topics (unlike LDA).  Applied algorithm to the Science journal corpus to generate a topic model.  Used a graphing algorithm to describe connections between latent topics.  Associated each document with a latent vector of topics, which allows a user to browse documents by topic(s) they are related to.

G. Kumaran and J. Allan. Effective and Efficient User Interaction for Long Queries, Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 11-18. July 2008.

Summary: Selective interactive reduction and expansion (SIRE) of queries is used so users can interactively narrow returned search results.

X. Yi and J. Allan.
Evaluating Topic Models for Information Retrieval, to appear in the Proceedings of ACM 17th Conference on Information and Knowledge Management, Napa Valley, CA, October 26-30, 2008.
Summary: From the abstract, "
(1) topic models are effective for document smoothing; (2) more elaborate topic models that capture topic dependencies provide no additional gains; (3) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (4) topics discovered on the whole corpus are too coarse-grained to be useful for query expansion. Experiments to measure topic models' ability to predict held-out likelihood confirm past results on small corpora, but suggest that simple approaches to topic model are better for large corpora."

D. Mimno, A. McCallum. Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, To appear in UAI, 2008.
Summary: The authors propose Dirichlet-multinominal regression (DMR) which generates the prior distribution over topics specific to each document and based upon its observed features.  In results predict topics from author documents and also define a prior over topics and use for author prediction of documents.  Impressive performances in comparison with LDA and AT models (5).  Sampling phase in DMR no more complex (slow) than LDA (7).  Could potentially be extended to a hierarchical model (7).

Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Andrew McCallum, Xuerui Wang and Andres Corrada-Emmanuel. Journal of Artificial Intelligence Research (JAIR), 2007.
Summary: Presents Author-Recipient-Topic (ART) model in which each topic is a multinomial distribution over words and each author, recipient pair is a multinomial distribution over topics. People can be clustered based upon topics to elicit roles independent of the network of people they are connected to (2). Role-Author-Recipient-Topic (RART) generates explicit roles and conditions the topics based upon them (3).  Potential extension to incorporate temporal information into the model (7).  Can generate topic relevance measures for an author relative to the roles they occupy (20).

G. Xue, et al. Implicit Link Analysis for Small Web Search. 2003.
Summary: Differentiates navigational links, recommendation links, and implicit recommendation links that are based on usage patterns.  Implicit links correspond to log entries, an implicit path is a path through implicit links, an explicit path is a path from one implicit link to another with explicit links in between.  The weight of an edge in the implicit link graph is it's normalized support (3).  Experimentally the higher the support the more precise the implicit link and an implicit link is to be a recommendation link than an explicit link (4).  In experiments using pages ranked by people as the ideal implicit PageRank showed considerable improvements over full text, explicit PageRank, DirectHit, and modified-HITS algorithms (5).  Shows that in small webs links do not necessarily meet the requirement of being recommendations that is needed for PageRank to function effectively.

C. Wang, D. Blei, and D. Heckerman. Continuous time dynamic topic models. In Uncertainty in Artificial Intelligence (UAI), 2008. (PDF).

X. Wang, A. McCallum. Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. KDD, August 2006.

A. Grubber, M. Rosen-Zvi, Y. Weiss. Latent Topic Models for Hypertext
. 2008.

An Introduction to Conditional Random Fields for Relational Learning


G. Luo, C. Tang, H. Yang, and X. Wei. MedSearch: A Specialized Search Engine for Medical Information Retrieval. [pdf] Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08), industry session, Napa Valley, CA, Oct. 2008, pp. ?-?.
Summary: Medical search different because users search more exploratory, search through related information.  Queries are generally longer and Google has a 32 word limit, other engines generally have length limits.  MedSearch drops unimportant terms, returns diverse search results, and suggests medical phrases relevant to query.  Uses an establish hierarchical ontology of medical terms (MeSH) to identify and rank medical phrases in the returned top web pages (3).

G. Luo, C. Tang. On Iterative Intelligent Medical Search. [pdf] Proc. 2008 Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'08), Singapore, July 2008, pp. 3-10.
Summary: First, relevant symptoms and signs are automatically suggested based on the searcher's description of his situation. Second, instead of taking for granted the searcher's answers to the questions, iMed ranks and recommends alternative answers according to their likelihoods of being the correct answers. Third, related MeSH medical phrases are suggested to help the searcher refine his situation description.


Hawking, D. (2004). Challenges in Enterprise Search. In Proc. Fifteenth Australasian Database Conference (ADC2004), Dunedin, New Zealand. CRPIT, 27. Schewe, K.-D. and Williams, H. E., Eds. ACS. 15-24.
Summary: Review of enterprise search problems including "3.5 Estimating importance of non-web documents" (6) to which topic based probability linking seems a good solution.  Exploiting context is also mentioned (3.6, 6).  The proposal to convert documents to HTML so that they can be analyzed as normal is criticized for lack of link information documents would have.


Bayesian Modeling of Dependency Trees Using Hierarchical Pitman-Yor Priors. Hanna Wallach, Charles Sutton, Andrew McCallum. In International Conference on Machine Learning, Workshop on Prior Knowledge for Text and Language Processing. (ICML WS), 2008.
Summary: Describe two hierarchical models for Bayesian dependency trees and use them to parse sentences.  Use latent variables to cluster parent-child dependencies resulting in improved parse accuracy.


B. Zhang, et al.  Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification. Annual ACM Conference on Research and Development in Information Retrieval.  2005.
Summary: Use GP to combine various bibliometrics with +, *, /, sqrt (why no -?).  Found GP trees perform better than individual metrics and slightly better than SVM, although SVM greatly outperform there method on some categories.  Is there a way to choose the better of SVM and GP per category?  Paper is poorly written.

User Interaction Design
M. Schmettow. User Interaction Design Patterns for Information Retrieval  Systems.

ML Software & Toolkits

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

R. Bekkerman. MDC Documentation

Summary: The MDC Toolkit takes a contingency table and parameters file and returns clusters.

The Intentional Web

Thursday, February 26, 2009

The majority of the time one browses or searches the web there is a goal in mind. Find the location of a coffee shop, learn more about cloud computing, see if there are any interesting new movies, be distracted and procrastinate. Each of these instances of web use has objectives and implicitly defines a success predicate. When one (the agent) interacts with the web, a computer or simply information (the system), that systems knowledge or discovery of an explicit representation of the agent's objectives, and the success predicates for these objectives, greatly enhances its capability to assist the agent in accomplishing its objectives.

Suppose I enter the query, "lee's art supplies nyc," into a search engine. A currently standard keyword search (PageRank, HITS, etc.) returns a list of results containing the keywords sorted based on page popularity. A semantic web search might parse out "lee's" as a title, "art supplies" as an object, and "nyc" as a location. Using this meaning based information the search could return the intersection of sets where each component appears as the correct semantic type and order by some bias between nearest neighbors based relevance and popularity. We are then returned a list of pages with things related to the title "lee's" and objects related to "art supplies" and locations related to "nyc". Neither of the above methods takes into account the user's intent in conducting their search.

There are a number of plausible intents one could have when entering the above example query. Maybe I want to find the location of..., learn the history of..., list the competitors of..., etc. Whatever the objective is, it is opaque to the query string. The intent is not calculated, it is not known.

Consider the same example but suppose that one can enter an objective along with their query (or that the engine can determine a probability distribution amongst objectives from the query and subsequent user actions). Assume my objective is something along the lines of, learn the history of.... It needn't be exactly determined by the algorithm as it likely isn't exactly determined or know by the user. Abstractly, we suppose that for each user query there exist a set of objectives and a distribution of importance over the accomplishment of these objectives. Given this set of objectives an information gathering agent is constructed. The agent's objective is to have the user accomplish the user's objectives. In the current example the agent's objectives may become something like user learn... which is then divided into self learn... and self teach user.... Each of these task is then further deconstructed into its simpler components as necessary and all tasks are completed bottom up.

The above characteristic of the intentional web is its ability to ascribe goals to users and conduct itself in a manner appropriate given those ascribed goals. Arguably, this is a change in the web's perception of user actions and not essentially a change in the web itself. A change in the web itself comes when we stop viewing the web and its components as just documents and data; as just network components (graphs) and purveyors of meaning (semantic units).

Each sub-graph of the web is seen as an agent capable of transitively effecting every connected component and itself in a non-well-founded manner. If there exists a broad objective that will increase the fitness of a large number of individual agents, the web agent community will transiently organize into larger units if necessary to accomplish this objective. The web becomes a community of organisms interacting with each other to accomplish their goals and increase their fitness.

As an example, consider the agent: financial news gatherer, whose objective it is to maintain connections to all pieces of financial news available throughout the web. The graph (group of sub-agents) that makes up this agent may contain the link gatherer, the financial news identifier (which itself may be a connection between the news identifier and the finance identifier), etc. If any of the sub-agents of the financial news gatherer increases their fitness, the financial news gatherer increases its fitness as well. What's more, when any agent whatsoever is improved, any other agent that contains the first agent will (generally) be improved as a welcome side-effect. Suppose the link gatherer is improved, any application that gathers links will improve with no required change to its own structure.

Decomposing specialties and organizing programs as links between specialized components is essential to modern programming and described by the concept of design patterns. Web programming has been moving in this direction with the popularity of model-view-controller (MVC) frameworks and plug-in architectures. These are positive movements but only transform and increase efficiency on a site level basis, not a network or community basis.

The trend towards application programming interfaces (APIs) and web services is significantly more relevant to the development of the intentional web. As an example, I've recently created a data aggregation web service, called Pairwise, and another service that pulls photos from Flickr, runs them through Pairwise and presents them to the user. We can think of Pairwise as a sort function, sort, and Flickr as a store of data, data. With these components built and maintained all a user must do to sort photos is the rather intuitive: sort(data). This is obviously still a ways away from what I've described above, yet it incorporates the essential component of specialized functionality.

Web mash-up building services, such as Yahoo! Pipes, provide an interface for users to create arbitrary combinations of data from throughout the web. With Pipes, people use a simple web interface to combine icons representing various programming functions. For example, one can drag a fetch data icon to a reverse icon, one can also embed "pipes" within each other. Pipes is a good example of integrating an abstract programming interface with internet data. But to add functions beyond what Pipes offers once must build an external web service and call it from within Pipes, which can perhaps be a nuisance if there is a large amount of additional functionality needed. Another problem with Pipes, as well as all mash-up builders, is related to API interoperability.

A current significant problem in moving towards an internet of connected services revolves around standards. To retrieve photos from Flickr, or use any other web service, learn the application interface and write (or use preexisting) code that operates within the domain of the application interface (in many cases one must also signup for an account with the service). There are no standards (or no standards commonly used throughout the web) that one can follow when writing an API and that one can refer to when interfacing with an arbitrary API, APIs are standard on a per application basis. In operating systems development it was quickly realized that not having specifications for APIs (a standardized API for APIs or meta-API) was extremely inefficient and a set of standards know as POSIX was developed in response. We need a set of POSIX-like open standards for the internet. For APIs to be completely not-interoperable until specialized code is written on a per-API basis is highly unproductive.

Without API standards the development of an intentional web, in the sense of a network that organizes information based on the determined goals of the other members of that network, is still possible. The difficulty would be that the programs which mine goals (intention) from web information would either need to be primarily self-contained or link together other services that use proprietary APIs themselves. Because large-scale adoption of new technologies by internet developers can be slow and is normally not done without justifiable cause, an effective approach may be to build an intentional web by linking other services together but linking them through an API-interpreter that converts arbitrary APIs into an open standard. Those who wish to use the intentional features of the web or any web service can use the open standards. Those who wish to develop services integrated into the intentional web, or accessible by other services using the open standards, can either write their API in conformance with the standards or write an interpreter that translates their API into the open standards. Additionally, anyone who wants a conforming version of an existing API can write an interpreter and thereby make this API available to all users. Preferably, there would be a discovery agent that builds an interpretation of arbitrary APIs into open standards as well as documents APIs in open standards format and in their original format. After processing an API, the discovery agent would monitor the API for changes and update the interpretation and documentation as necessary to maintain current functionality and add newly introduced functionality.

There are certainly more hurdles before a complete intentional web is developed but, without either open standards or automated API interpreters, only hubs of services capable of communicating with each other will develop. These will likely be controlled by Google, Microsoft, Amazon, and other big industry companies. With systems made interoperable based on open standards we would be able to connect one service to another without hassle. As all standards must, meta-API standards must have flexibility and extensibility built in as a core component. Ideally, these standards would emerge naturally from the existing APIs on the web and gracefully bend to fit changes occurring in the web over time. An eventual change in the web will be a movement beyond an intentional web. Systems should be developed to accommodate radical alteration of the fundamental structure that defines them.

Composing Inverse Functions to Measure Model Fitness

Friday, December 05, 2008

This articles concerns a method for evaluating the fitness of content topic models and document topic models based on the dissonance between a set of documents and the set of documents generated by composing inverse functions and applying them to the original set of documents. A document generating function is applied to a topic generating function that is applied to the original set of documents. In order to compare topics, one can look at the original set of topics compared to set of topics generated by apply a topic generating function to the documents generated by applying a document generating function to the original set of topics.

In the notation below let D be a corpus, T be a set of topics, F be a set of content topic models such that each member has domain a subset of D and range a subset of T, and let G be a set of topic document models such that each member has domain a subset of T and range a subset of D. (In other words f({d1, ...}) = {t1, ...} and g({t1, ...}) = {d1, ...}.) Let bold face represent sets and italics represent functions or singletons, thus: f represents a member of F, d a member of D, etc. Let "*" represent the dot operator when applied to sets and multiplication when applied to scalars, |A| is a scalar representing the cardinality of the set A.

To measure the difference between d and g(f(d)) we may look at the value of:

(1) d * g(f(d))

It is arguable that a more fit f or g will result in greater values for (1). We can also look at a comparison between g(f1(d)) and g(f2(d)). Under certain circumstances, it may be reasonable to assume that if:

(2) d * g(f1(d)) > d * g(f2(d))

the topic generating function f1 is more fit than f2.

What (2) says is that the number of documents both in our original set and in the set of documents generated (by applying a document generating function to a topic generating function applied to that original set) is larger if we use the topic generating function f1 versus f2. As an analogy, suppose I show a bunch of books to two people, Brin and Sergey. Brin says all my books are from one set of genres and Sergey says they're from another (the genres they choose may overlap). Now suppose I take Brin's genres and Sergey's genres and separately feed them into Amazon which returns two sets of books: Brin's books and Sergey's books. If it turns out that there are only a couple of Brin's books in common with my original books, while more than half of Sergey's books are in my original set of books, I'd consider this a good reason to assume Sergey's genres were more accurate than Brin's.

Of course it could be that Sergey just got lucky with his topics, or that Brin was unlucky. It could also have been that the Amazon algorithm returned more documents for Sergey's genres and his chances were thereby improved, or that the Amazon algorithm is wrong and its incorrect results are merely misleading me into thinking Sergey's genres are relevant. Additionally, Sergey could have given me a slew of genres to improve his chances. The three problems here are (A) the size of the set of documents generated, (B) the size of the set of topics given, and (C) the fitness of document generation algorithm.

To guard against (A) the size of the set of documents that is generated we bias towards a large number of generated documents overlapping with our original documents and against those generated documents that are not in the original set of documents. To do this we compute the dot product between d and g(f(d)) then normalize.

(3) (d * g(f(d)))/(|d| * |g(f(d))|)

To discount (B) the size of the set of topics generated we normalize for the size of that set.

(4) (d * g(f(d)))/(|d| * |g(f(d))| * |f(d)|)

To account for (C) poor fitness of the document generation algorithm we may take the sum of (4) over a set of document generating algorithms applied to the same topic generating algorithm.

(5) sum_i (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

This representation has given us a way to compare the accuracy of Brin's and Sergey's book genres that will be robust against both gaming by giving us lots of genres and the quality of the algorithm that chooses what books are in the genres we are given. So, back to content topic models, we can take a set of topics models F and assign a fitness to each one using equation (5) and a subset of the document models in D. We can now generate a fitness for each document model in D and weight it according to the fitness of the topic model used. Define s(f) (or s(g)) to be equal to the fitness assigned to f (or g) by equation (5) or one of (6) or (7) below.

(6) sum_i (s(fi) * (d * g(fi(d)))/(|d| * |g(fi(d))| * |fi(d)|)

(7) sum_i (s(gi) * (d * gi(f(d)))/(|d| * |gi(f(d))| * |f(d)|)

Equations (6) and (7) can be repeatedly evaluated to tune the fitness assigned to the members of F and D.

The abstraction of the basic idea is quite simple: use a consortium of functions, weighted by their fitness, to evaluated the fitness of another function. There are many variations and extensions that could be used. We are still developing this idea and more exploration regarding the use of a feedback loop to tune fitness is in order.

Deficiencies of non-Non-linear Learning

Tuesday, November 25, 2008

Non-linear learning is at the heart of the algorithms that underlie what makes Helioid's search effective. To examine non-linear learning let us first define the term in contrast with linear learning. (I would like to briefly note that these definitions, though culled from machine learning and logic literature, are primarily my own and I expect some to disagree with them).

Linear learning occurs when each step in a procedure depends solely on deductions made from the previous steps. Furthermore, the deductions must be valid according to some set of rules. To take the example of modus ponens from formal logic, suppose we know '"it is daytime"' and that 'if "it is daytime" then "we can see the sun"'. Given a standard model theory we may deduce, "we can see the sun" by detachment. In search, the reasonable assumption is made that 'if "Item A appears above Item B in search results" then "Item A is more relevant than Item B"' and Item A should be displayed with more prominence to the user, normally closer to the top of the list of search results. This line of reasoning is perfectly acceptable but there are times when it quickly runs into trouble for the searcher.

Suppose I am writing music visualization software and I need information about the Processing graphics libraries' particle system. My first query might be, "processing graphics library particle system". Try it on "Google": The results I get aren't bad for what I was looking for. The 6th entry links to a "page on": from which I can find another link to Jeffrey Traer Bernstein's "page":, which has a particle system library for Processing.

But, going back to Google's results, a look through the top 500 finds no direct link to Traer's page! Looking through the results the problem may be phrase based. There are lots of results for "graphics" and "library" and "particle" and "system". Being a well trained Google user I'd wrap "particle system" in quotes and "re-search": I also might switch "library" to "language" and try to find a phrase I could combine "processing" with.

But still I'd have no luck getting Traer's page within the top 500 results and most of the results would be irrelevant to my query (the two above linked queries were tested on November 24 2008). One problem is that I can't interactively say, "this type of result is irrelevant, get rid of it." I could add some exclusionary keywords or phrases to my query but that would require a good amount of guess work on my part. I would need to find the right keywords or key-phrases to exclude and there is no guarantee I will be able to do this effectively or even helpfully.

Another problem is that I can't say, "this type is on the right track, give me more of these," without making adjustments to my query. The problem here is that when I make these adjustments to my query the initial work I've done in sorting through the results will be lost when my new results are shown.

Google's failure to present the user with relevant results stems from an inability to capture user feedback and implement a method for users to provide feedback. The failure in part stems from the assumption that a user wants to go through their search results in a linear fashion. Here are a couple ways in which Helioid's non-linear approach to searching helps solve this problem.

1. Let the user tell the search engine what they want.

The user knows when a result is irrelevant, and they should be able to get rid of it. The user also knows when a result is relevant and they should be able to bias the algorithm towards results of this type. Some users may know when an entire class of results is irrelevant (or relevant). They should be able to bias their search results against (or for) showing results that are members of this class. In the example given above I know I don't care about physics papers so, using Helioid, I can tell the search engine my opinion and it will bias the results is returns against them.

This is a non-linear method because the search results (the deductions) are being modified in a way that is inconsistent with the rules by which they were generated (the theory).

2. Have the search engine learn from what the user does.

In many cases it may be unnecessary for the user to explicitly tell the search engine what they want. Going back to our "Processing library" example, after the user is seen ignoring all the pages related to all the academic papers it may be appropriate to assume the user is uninterested in them. Maybe even ask the user if they'd like them to be removed.

For the same reasons as above this is another example of non-linear learning.

Assuming that the algorithm knows the answer tout court is a dreadful mistake. The most important input to the algorithm should be the desires of the user as they are expressed in keywords and through actions. If a user is found scrolling through multiple pages this is an input and the search results should be adjusted to take this new knowledge into account. Ignoring user input is the same as not adjusting your reefing after the wind shifts, you'll go off-course.