Jauhiainen T., Zampieri M., Baldwin T., Lindén K. Automatic Language Identification in Texts

Файл формата pdf
размером 6,25 МБ

Добавлен пользователем Евгений Машеров 07.01.2024 11:10
Описание отредактировано 08.01.2024 00:46

Jauhiainen T., Zampieri M., Baldwin T., Lindén K. Automatic Language Identification in Texts

Springer, 2024. — 155 p.

This book provides readers with a brief account of the history of Language Identification (LI) research and a survey of the features and methods most used in LI literature. LI is the problem of determining the language in which a document is written and is a crucial part of many text processing pipelines. The authors use a unified notation to clarify the relationships between common LI methods. The book introduces LI performance evaluation methods and takes a detailed look at LI-related shared tasks. The authors identify open issues and discuss the applications of LI and related tasks and proposes future directions for research in LI.

Foreword
Acknowledgments
References
About the Authors
Introduction to Language Identification
A Brief History of Language Identification (LI)
What is LI Used For?
What are the Main Challenges that Make LI Difficult?
Features and Methods
On Notation
What Textual Features Are Used for LI and How Are They Collected and Calculated?
Feature Smoothing
What Classification Methods Are Used for LI and How Do They Work?
Decision Rules, Trees and Random Forests
Simple Scoring
Sum or Average of Values
Product of Values
Similarity Measures
Logistic Regression
Support Vector Machines
Neural Networks
Ensemble Methods
Machine Learning Toolkits and Libraries
Evaluation and Measurement
How is LI Performance Evaluated? What Are the Measures …
What Material Can Be Used in Training and Evaluating Language Identifiers?
LI Shared Tasks
Specific Challenges of Variation and Text Types
Language Similarity
LI for Similar Languages, Varieties, and Dialects
Low-Resource Languages
Orthography and Its Variations
Short Texts
Large Scale, Multi-domain Language Identification
Number of Languages
Unseen Languages
Multilingual Texts
Domain Compatibility
Applications and Related Tasks
Applications
Monolingual NLP Components
Machine Translation
Multilingual Document Storage and Retrieval
Related Tasks
Native Language Identification
Author Profiling and Identification
Conclusion and Future Directions