I recently concluded a year-long research project with Statistics Professor Dean Foster, Computer Science Professor Lyle Ungar, and Statistics Ph.D Student Adam Kapelner. The subject of our research was in Natural Language Processing (NLP): a quickly growing field at the intersection of Statistics, Computer Science and Linguistics. Many people are familiar with the field due to Amazon’s famous recommendation engine, which is notoriously successful in recommending products to consumers based on past purchasing history and consumer reviews.
At the highest level, NLP is a subfield of artificial intelligence dedicated to the study of human communication and understanding. NLP is dedicated to enabling computers to understand, derive meaning, and act usefully on human language. Given this definition, the first example which comes to my mind is Star Wars, in which futuristic robots like R2D2 are able to interact with humans via human language. While this may seem a little too futuristic and baseless, there are many examples of NLP in everyday life. One example is Google Translate: Google’s tool for translating language from one language to another. A second example is sentiment analysis: the process of recognizing whether a given sentence expresses positive, negative or neutral sentiment on some object or idea. Hedge funds are using automated sentiment analysis of twitter posts to try to determine changes in sentiment towards securities to predict price movements before they occur.
Our specific area of research was word-sense disambiguation (WSD). I will illustrate the definition with an example: “I went to the bank today and deposited a check.” In the given sentence, which sense of the word “bank” best matches it’s usage in the sentence above: (a) the land area nearest to a river, (b) a financial institution or (c) a movement made by a plane in flight? This question is fundamental to natural language processing – after all, how would a computer understand human language if it could not distinguish the proper sense of the word “bank” in a sentence? While this task itself is not of great importance, this task is crucial for many end uses of NLP. For example, for Google Translate to translate this sentence into Russian, it must recognize “bank” as a financial institution, as opposed to land near a river or plane movements, which are represented by different words in the Russian language.
The specific question within the subfield of WSD we sought to answer was this: when is it easy for a human to perform WSD and when is it hard? In the example above, WSD is very easy for the average human. However, there exist many other cases in which the word to be disambiguated has many similar senses which are hard to distinguish between. In order to accomplish this goal, we started by collecting data using a crowd sourcing approach. We utilized Amazon’s Mechanical Turk service, which allows researchers and companies to post simple tasks which are performed by “turks”, humans around the world looking to make money. Turks are paid for each task performed, usually $0.01 – $0.05 per task, each of which usually take less than a minute to complete. We built a website which generated WSD tasks for turkers as they accessed our site, and were able to collect 10,000 completed tasks in two days.
With 10,000 data points, we then used regression to test which features of tasks are significant in predicting difficulty. Some of these features included: the number of possible senses of a word to choose from, the number of rephrasings of the correct sense, the frequency of the given word in English and many others. Using these features, we built a model to predict task difficulty. Finally, we also looked at voting schemes: if you want a WSD task to be performed, how many turkers should you ask? We found asking two turkers and using a third to break the tie if it occurs, generated an overall accuracy a smidge above 80% on our data set and seemed to be the most economical solution (in contrast, asking ten workers to perform the same task and taking the most popular answer generated an overall accuracy of about 86%, not much higher given the extra seven to eight turkers you need to employ for each task).
Why are these results significant? First, one who is trying to build a WSD system would want to know quantifiably which features actually drive difficulty in performing WSD (i.e. why is WSD difficult and what makes certain WSD tasks harder than others?), the central question our paper answers. Second, our paper confirms many of the results previously seen in the academic literature, such as: words with more senses are generally harder to disambiguate and more common words are harder to disambiguate. And third, our research demonstrates it is possible to build an automated WSD system using Mechanical Turk to process thousands of tasks per day, giving a practitioner a sense of the cost involved.
At the conclusion of this research project, we wrote a paper and submitted it to COLING 2012, the 24th International Conference on Computational Linguistics, held December of this year in Mumbai, India. Adam presented it at the conference and the published paper will be available online shortly afterwards.