Armstrong S., Church K., Isabelle P., Manzi S., Tzoukermann E., Yarowsky D. (eds.) Natural Language Processing Using Very Large Corpora

Файл формата djvu
размером 3,61 МБ

Добавлен пользователем Shushimora 02.06.2014 21:02
Описание отредактировано 09.09.2016 05:17

Armstrong S., Church K., Isabelle P., Manzi S., Tzoukermann E., Yarowsky D. (eds.) Natural Language Processing Using Very Large Corpora

Springer, 1999. — 315 p.

This book is intended for researchers who want to keep abreast of current developments in corpus-based natural language processing. It is not meant as an introduction to this field; for readers who need one, several entry-level texts are available, including those of (Church and Mercer, 1993; Charniak, 1993; Jelinek, 1997).
This book captures the essence of a series of highly successful workshops held in the last few years. The response in 1993 to the initial Workshop on Very Large Corpora (Columbus, Ohio) was so enthusiastic that we were encouraged to make it an annual event. The following year, we staged the Second Workshop on Very Large Corpora in Kyoto. As a way of managing these annual workshops, we then decided to register a special interest group called SIGDAT with the Association for Computational Linguistics. The demand for international forums on corpus-based NLP has been expanding so rapidly that in 1995 SIGDAT was led to organize not only the Third Workshop on Very Large Corpora (Cambridge, Mass.) but also a complementary workshop entitled From Texts to Tags (Dublin).
Obviously, the success of these workshops was in some measure a reflection of the growing popularity of corpus-based methods in the NLP community. But first and foremost, it was due to the fact that the workshops attracted so many high-quality papers.
The importance of this material for the field is such that it deserves to be made more readily available than harder-to-find or out-of-print workshop proceedings. We are grateful to Kluwer for providing us with the opportunity to publish here what we view as an outstanding collection of papers. Space constraints forced us to make hard editorial choices among all available papers presented at the workshops. One of the criteria we used in selecting among papers was the need to maintain a reasonable thematic balance.
The chapters are organized in a structure that unfolds "bottom-up", from local to more global phenomena. Section 1 presents some techniques for assigning part-of-speech tags to the words of a text the basis of their local context. Section 2 extends the notion of word tag to the semantic domain, introducing methods for disambiguating between the various senses of a word. Section 3 features two attempts at describing possible combinations of words: one based on the identification of idiomatic expressions, the other on techniques for modeling word sequences beyond traditional n-gram methods. Section 4 examines some ways of improving the performance of syntactic parsers on real-life texts. Finally, Sections 5 and 6 introduce techniques that venture beyond the level of the single document: in the first case, the goal is to capture cross-document similarities in parallel texts of different languages; in the second, t.o capture cross-document dissimilarities in different texts of the same language.

Implementation and Evaluation of a German HMM for POS Disambiguation.
Improvements in Part-of-Speech Tagging with an Application to German.
Unsupervised Learning of Disambiguation Rules for Part-of-Speech Tagging.
Tagging French without Lexical Probabilities – Combining Linguistic Knowledge and Statistical Learning.
Example-Based Sense Tagging of Running Chinese Text.
Disambiguating Noun Groupings with Respect to WordNet Senses.
A Comparison of Corpus-based Techniques for Restoring Accents in Spanish and French Text.
Beyond Word N -Grams.
Statistical Augmentation of a Chinese Machine-Readable Dictionary.
Text Chunking Using Transformation-based Learning.
Prepositional Phrase Attachment through a Backed-off Model.
On the Unsupervised Induction of Phrase-Structure Grammars.
Robust Bilingual Word Alignment for Machine Aided Translation.
Iterative Alignment of Syntactic Structures for a Bilingual Corpus.
Trainable Coarse Bilingual Grammars for Parallel Text Bracketing.
Comparative Discourse Analysis of Parallel Texts.
Comparing the Retrieval Performance of English and Japanese Text Databases.
Inverse Document Frequency {IDF}: A Measure of Deviations from Poisson.