Los Angeles: University of California Press, 2015. — 252 p. — ISBN: 978–0–520–28097–7
We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Additionally, powerful algorithms are capable of churning through seas of data to uncover patterns. Providing a simple and accessible introduction to data mining, Paul Attewell and David B. Monaghan discuss how data mining substantially differs from conventional statistical modeling familiar to most social scientists. The authors also empower social scientists to tap into these new resources and incorporate data mining methodologies in their analytical toolkits. Data Mining for the Social Sciences demystifies the process by describing the diverse set of techniques available, discussing the strengths and weaknesses of various approaches, and giving practical demonstrations of how to carry out analyses using tools in various statistical software packages.
Concept
What Is Data Mining?The Goals of This Book
Software and Hardware for Data Mining
Basic Terminology
Contrasts with the Conventional Statistical ApproachPredictive Power in Conventional Statistical Modeling
Hypothesis Testing in the Conventional Approach
Heteroscedasticity as a Threat to Validity in Conventional Modeling
The Challenge of Complex and Nonrandom Samples
Bootstrapping and Permutation Tests
Nonlinearity in Conventional Predictive Models
Statistical Interactions in Conventional Models
Some General Strategies Used in Data MiningCross-Validation
Overfi tting
Boosting
Calibrating
Measuring Fit: The Confusion Matrix and ROC Curves
Identifying Statistical Interactions and Eff ect Heterogeneity in Data Mining
Bagging and Random Forests
The Limits of Prediction
Big Data Is Never Big Enough
Important Stages in a Data Mining ProjectWhen to Sample Big Data
Building a Rich Array of Features
Feature Selection
Feature Extraction
Constructing a Model
Worked Examples
Preparing Training and Test DatasetsThe Logic of Cross-Validation
Cross-Validation Methods: An Overview
Variable Selection ToolsStepwise Regression
The LASSO
VIF Regression
Creating New Variables Using Binning and TreesDiscretizing a Continuous Predictor
Continuous Outcomes and Continuous Predictors
Binning Categorical Predictors
Using Partition Trees to Study Interactions
Extracting VariablesPrincipal Component Analysis
Independent Component Analysis
ClassifiersK-Nearest Neighbors
Naive Bayes
Support Vector Machines
Optimizing Prediction across Multiple Classifiers
Classification TreesPartition Trees
Boosted Trees and Random Forests
Neural Networks
ClusteringHierarchical Clustering
K-Means Clustering
Normal Mixtures
Self-Organized Maps
Latent Class Analysis and Mixture ModelsLatent Class Analysis
Latent Class Regression
Mixture Models
Association RulesNotes