Baayen R.H. Word Frequency Distributions

Файл формата djvu
размером 2,75 МБ

Добавлен пользователем Shushimora 24.10.2013 18:15
Описание отредактировано 27.06.2021 04:44

Baayen R.H. Word Frequency Distributions

Kluwer, 2001. — 359 p.

This book is an introduction to the statistical analysis of word frequency distributions, intended for linguists, psycholinguistics, and researchers working in the field of quantitative stylistics and anyone interested in quantitative aspects of lexical structure. Word frequency distributions are characterized by very large numbers of rare words. This property leads to strange statistical phenomena such as mean frequencies that systematically keep changing as the number of observations is increased, relative frequencies that even in large samples are not fully reliable estimators of population probabilities, and model parameters that emerge as functions of the text size.
Special statistical techniques for the analysis of distributions with large numbers of rare events can be found in various technical journals. The aim of this book is to make these techniques more accessible for non-specialists. Chapter 1 introduces some basic concepts and notation. Chapter 2 describes non-parametric methods for the analysis of word frequency distributions. The next chapter describes in detail three parametric models, the lognormal model, the Yule-Simon Zipfian model, and the generalized inverse Gauss-Poisson model. Chapter 4 introduces the concept of mixture distributions. Chapter 5 explores the effect of non-randomness in word use on the accuracy of the non- parametric and parametric models, all of which are based on the assumption that words occur independently and randomly in texts. Chapter 6 presents examples of applications.
Throughout the book, concepts of probability theory and statistics necessary to understand the analysis of word frequency distributions are carefully introduced. However, as this is not an introductory textbook to statistics and probability theory, readers with little background knowledge in these fields will find it useful to consult introductory textbooks such as Ross (1988) and Rice (1988). In order to make the text generally accessible, non-technical summaries precede the more technical sections, while the mathematical derivations have been kept simple by going through the proofs with small steps. This leads to the paradoxical situation that some pages may look very scary while being quite easy to read. A great many figures illustrate key concepts and results. Chapters 1 and 6 are relatively non-technical and should be generally accessible. Chapters 2-5 require some knowledge of mostly elementary calculus. As sections 2.5, 2.6.2,3.2-3.4, and 4.2 are fairly technical, some readers may want to restrict themselves to the non-technical summaries preceding these sections.
Four appendices are included. Appendix A is a list of symbols. Appendix B gives solutions to the exercises found at the end of the first four chapters. Appendix C provides the documentation to the programs on the CDrom that comes with this book. As there is at present no generally available software available for carrying out the kind of statistical analyses described in this book, I am making lexstats available under the GNU General Public License. Lexstats is a suite of programs written in C including a graphical user interface written in Tcl/Tk. Updates can be obtained from the author by e-mail at baayen@mpi . nl. Lexstats is supported for Linux only. It should run without problems on Unix platforms, and the individual C-programs will probably run on other platforms as well. All C-programs require input and produce output that is in the data frame format of R and Splus, so that the user is not limited to the functionality provided by the graphical user interface. Finally, Appendix D summarizes the frequency distributions of the main data sets analyzed in this book.

Word Frequencies
Non-parametric models
Parametric models
Mixture distributions
The Randomness Assumption
Examples of Applications
A List of Symbols
B Solutions to the exercises
C Software
D Data sets