Springer, 2012. — 127 p.
The impact of computer systems that can understand natural language will be tremendous. To develop this capability we need to be able to automatically and efficiently analyze large amounts of text. Manually devised rules are not sufficient to provide coverage to handle the complex structure of natural language, necessitating systems that can automatically learn from examples. To handle the flexibility of natural language, it has become standard practice to use statistical approaches, where probabilities are assigned to the different readings of a word and the plausibility of grammatical constructions.
Unfortunately, building and working with rich probabilistic models for real-world problems has proven to be a very challenging task. Automatically learning highly articulated probabilistic models poses many challenges in terms of parameter estimation at the very least. And even if we succeed in learning a good model, inference can be prohibitively slow. Coarse-to-fine reasoning is an idea which has enabled great advances in scale, across a wide range of problems in artificial intelligence. The general idea is simple: when a model is too complex to work with, we construct simpler approximations thereof and use those to guide the learning or inference procedures. In computer vision various coarse-to-fine approaches have been proposed, for example for face detection (Fleuret et al. 2001) or general object recognition (Fleuret et al. 2001). Similarly, when building a system that can detect humans in images, one might first search for faces and then for the rest of the torso (Lu et al. 2006). Activity recognition in video sequences can also be broken up into smaller parts at different scales (Cuntoor and Chellappa 2007), and similar ideas have also been applied speech recognition (Tang et al. 2006). Despite the intuitive appeal of such methods, it was not obvious how they might be applied to natural language processing (NLP) tasks. In NLP, the search spaces are often highly structured and dynamic programming is used to compute probability distributions over the output space.
In this work, we propose a principled framework in which learning and inference can be seen as two sides of the same coarse-to-fine coin. On both sides we have a hierarchy of models, ranging from an extremely simple initial model to a fully refined final model. During learning, we start with a minimal model and use latent variables to induce increasingly more refined models, introducing complexity gradually. Because each learning step introduces only a limited amount of new complexity, estimation is more manageable and requires less supervision. Our coarse-to-fine strategy leads to better parameter estimates, improving the state-of-the- art for different domains and metrics.
However, because natural language is complex, our final models will necessarily be complex as well. To make inference efficient, we also follow a coarse-to-fine regime. We start with simple, coarse, models that are used to resolve easy ambiguities first, while preserving the uncertainty over more difficult constructions.
The more complex, fine-grained, models are then used only in those places where their rich expressive power is required. The intermediate models of the coarse-to-fine hierarchy are obtained by means of clustering and projection, and allow us to apply models with the appropriate level of granularity where needed. Our empirical results show that coarse-to-fine inference outperforms other approximate inference techniques on a range of tasks, because it prunes only low probability regions of the search space and therefore makes very few search errors.
Latent Variable Grammars for Natural Language Parsing
Discriminative Latent Variable Grammars
Structured Acoustic Models for Speech Recognition
Coarse-to-FineMachine Translation Decoding
Conclusions and FutureWork.