These are answers to questions posed by my friend Daniel Lemire who is preparing a course on information retrieval and unstructured data.
What is your current position and affiliation? What other positions did you held in the recent past? Feel free to give a short bio.
I am currently a Research Scientist at NASA’s Ames Research Center, where I work on autonomy architecture (that is, getting things–in our case, helicopters–to do things on their own–in our case, fly around and take pictures of things without crashing). Just prior to this, I worked at the Canadian National Research Council’s Institute for Information Technology centre in New Brunswick, where I worked on translation technologies. I have also worked in a number of start up companies working on (among other things) language technologies.
What is unstructured data and why should an organization care about it?
The flip answer is that it is data which doesn’t have any structure. But, of course, this can not be the case: the only truly unstructured data would be data generated by some kind of random process. Typically, “unstructured data” means data that do not have (SQL) database schemata defined for them for a particular purpose. So, a database of the names and addresses of job applicants is likely to be structured, but the resumes the applicants sent it is likely to be unstructured.
It’s often said that some large percentage of an institution’s information is unstructured data–just check out this Google search pattern: percent "unstructured data". Getting at information buried in this data can be crucial to the needs of a company or institution.
Much of this data are in documents, and the term text mining is often used for this important subset.
Almost all of the information available on the world wide web is unstructured data, of course. In fact, many say that the explosion of information available on the web is due to how easy it is to put data out on the web without having to carefully structure it first.
Please tell us about some of your research results and major contributions in the field of “unstructured data”?
Because this field is so broad, I wouldn’t consider that I’m an “unstructured data” researcher. Still, there are two areas that relate to unstructured data research.
One general area relates to using signals in the data to get qualitative descriptions of the data out. In my work on multimodal user interfaces, for example, I’ve created a modeling language for describing how time-varying events can predict user intentions. See, for example, our paper Multimodal Event Parsing for Intelligent User Interfaces. This describes a system that allows system designers to “parse” the events that come into a system in real-time in order to understand user intentions and other qualitative descriptions of the data. My more recent work has been in refining this concept, based (I think) on better models of the underlying data. Not much has been published on this yet, but if you’re really interested, you can read a recent technical conference paper “An architecture for intelligent management of aerial observation missions”. This really focuses on the autonomy area, but the appendix has hints, at least, on the monitoring/parsing language we are developing at Ames. One piece of this theory is in the paper written with Daniel Lemire and Martin Brooks Quasi-monotonic segmentation of state variable behavior for reactive control. In anything, these paper show both the wide applicability of these approaches, and how difficult it is to come up with just the right models!
But here’s an easier example of this: determining the language in which a text is written. For example, is the following text French or English:
A truly great book should be read in youth, again in maturity and once more in old age, as a fine building should be seen by morning light, at noon and by moonlight. - Robertson Davies
It turns out to be relatively easy to do so, even without a dictionary of English and French words. Take a large corpus of English text, and one in French, and count how frequently n-grams of letters occur in each. (An “n-gram” is just a combination of n letters). For example, in a bigram character model, ‘th’ is very frequent in English, ‘le’ is less so; the opposite is true in French. N-gram seem to follow a “Zipf’s law” distribution (in fact, Zipf based his original research on this). There are a number of formulations of Zipf’s law. Here’s one:
|r1/r-1 approx r-a|
where r is the rank of the item counted (i.e., it’s order in a sorting based on its frequency, and a is an empirical constant. To make use of these facts, there are two stages, a training stage, and the testing stage. From each corpus, I identify the relative frequency of each n-gram in the corpus. I treat the relative frequency as a probability pi, and calculate each n-gram’s information value, ivi = - lg(pi) (this gives the number of bits needed to most efficiently encode the n-gram in the corpus; treating the corpus as the “population,” it gives the most efficient encoding, in general for texts in that language. Then, to test which language a probe text is in, I sum the information value of the text in each of the languages; the one with the lowest information value is the best guess.
“Code-switching” is the term sociolinguists use to describe the phenomenon of a person switches from one language to another. The research problem is to identify code-switching within document. So, this becomes a micro-language identification problem. Consider this type of sentence familiar enough to Canadians, if not people in the U.S.
Bonjour! S’il vous plait, leave a message at the sound of the tone. Merci.Starts in French, switches to English. How can character n-grams be used to identify where the switch occurs? Note in particular that “message” is a word in both English and French, which suggests that knowing the probability of language identification of a (short) string is important (i.e., we can’t just depend on the “best guess”). But one can take “sliding windows” of text, and calculate the best guess for each of these, and make guesses thereby of where switches in language occur. (Note for example, that the word “message” is both a word in French and English, so we’ll need some statistical techniques and/or heuristics to decide whether a switch in fact occurs.
A second area is using dynamic planning for parsing conversations in real time by making only weak commitments to the models used while parsing. A workshop paper, Item descriptions add value to plans.. More closely related to text mining is a relatively early paper which used a model of typical questions asked/questions answered to create hypertext systems–See Using natural language processing to construct large-scale hypertext systems.
How did the management of unstructured data evolved in the current years and what factors contributed to this evolution? What are the major difficulties in information retrieval and what big discoveries are yet to be made?
The most important factor has been the explosion of information available on the world wide web. It is hard to overestimate the importance of this. Here are some interesting present challenges in information retrieval:
Dealing with untrustworthy data. Unscrupulous people are constantly placing untrustworthy data on the web. For example, web robots attempt to place comments on my weblog entries in order to fool the search engines on how important other websites are. In fact, entire weblogs are “weblog spam.” We’re all familiar with email spam. What are good techniques for dealing with this?
The size of the web. How large is the web? How much of it has a search engine indexed? This is a recent controversy with conflicting claims by Yahoo, Google and Microsoft. An NCSA study A Comparison of the Size of the Yahoo! and Google Indices compared Yahoo and Google. On the one hand, they examined actual results returned from the search engines; on the other, they investigated only English searches. The later makes it very unlikely to report a comparison fairly. Jean Veronis has written a series of posts examining reported results from Google, Yahoo and MSN. For example, one looks at whether Yahoo is indexing 19 billion pages–see his post. This has raised a lot of questions about, for example, what counts as a ”page.” His posts are well worth reading.
Tagging. Many sites, such as Flickr, are allowing users to “tag” entries with their own keywords. These are not connected to any formal system or ontology. And it seems to work pretty well for retrieval.
The Long Tail. The Zipf’s Law distribution (or similar distributions) mentioned above seems to come up again and again. For example, the most popular songs in Apple’s iTunes catalog are very, very popular, but it’s been claimed that every song in the catalog has been sold at least once. See the excellent graph at Chris Anderson’s Long tail weblog for music data from the Rhapsody service. The interesting search question is how find interesting things in the “long tail,” where it is theoretically harder to find them.
Issues of time. Increasingly, the web is “out of date,” meaning that the state of a webpage (say) changes from one access to another. For example, the NCSA study referenced above was changed after criticisms were made of it, and one of the author’s names was dropped (this is reported in the paper, but I believe it was not there at first). And, of course, web address change, items are moved, and it’s hard to find them again. I think this adds an interesting temporal aspect to the question of information retrieval: it’s not only important to know where to find something, but at what point in time it was originally created, modified, etc.
Do you agree to put this interview in the public domain?
This work is licensed under a Creative Commons Attribution 2.5 License. You are free:
to copy, distribute, display, and perform the work
to make derivative works
to make commercial use of the work
Under the following conditions: You must attribute the work in the manner specified by the author. For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the copyright holder.