No Starch Press, 2023. — 514 p.
Data-science investigations have brought journalism into the 21st century, and—guided by The Intercept’s infosec expert Micah Lee— this book is your blueprint for uncovering hidden secrets in hacked datasets.
In the current age of hacking and whistleblowing, the internet contains massive troves of leaked information. These complex datasets can be goldmines of revelations in the public interest— if you know how to access and analyze them. For investigative journalists, hacktivists, and amateur researchers alike, this book provides the technical expertise needed to find and transform unintelligible files into groundbreaking reports.
Using Python or other programming languages, you can give your computer precise instructions for performing tasks that existing tools or shell scripts don’t allow. For example, you could write a Python script that scours a million pieces of video metadata to determine where the videos were filmed. In my experience, Python is also simpler, easier to understand, and less error-prone than shell scripts. This chapter provides a crash course on the fundamentals of Python programming. You’ll learn to write and execute Python scripts and use the interactive Python interpreter. You’ll also use Python to do math, define variables, work with strings and Boolean logic, loop through lists of items, and use functions. Future chapters rely on your understanding of these basic skills.
Guided by renowned investigative journalist and infosec expert Micah Lee, who helped secure Edward Snowden’s communications with the press, you?ll learn the tools, technologies, and programming basics needed to crack open and interrogate datasets freely available on the internet or your own private datasets obtained directly from sources. Each chapter features hands-on exercises using real hacked data from governments, companies, and political groups, as well as interesting nuggets from datasets that never made it into published stories. You’ll dig into hacked files from the BlueLeaks law enforcement records, analyze social-media traffic related to the 2021 attack on the U.S. Capitol, and get the exclusive story of privately leaked data from anti-vaccine group America’s Frontline Doctors. Along the way, you’ll learn:
How to secure and authenticate datasets and safely communicate with sources
Python programming basics needed for data science investigations
Security concepts, like disk encryption
How to work with data in EML, MBOX, JSON, CSV, and SQL formats
Tricks for using the command-line interface to explore datasets packed with secrets
“Micah’s book is a fantastic and friendly introduction for journalists, activists, and anyone else who is interested in learning to analyze large data sets but has been too intimidated by the technical details. I hope this book will inspire more people to find the stories inside the data.” - Eva Galperin, Director of Cybersecurity at the Electronic Frontier Foundation
About the Author:
Micah Lee is the Director of Information Security at The Intercept and is known for helping secure Edward Snowden's communications while he leaked secret NSA documents. He used to work for the Electronic Frontier Foundation, and is currently an advisor to the transparency collective Distributed Denial of Secrets. He is also co-founder of the Freedom of the Press Foundation, a Tor Project core contributor, and he develops open source security and privacy tools like OnionShare and Dangerzone.
Acknowledgments
Introductioni
Sources and Datasets
Protecting Sources and Yourself
Acquiring Datasets
Tools of the Trade
The Command Line Interface
Exploring Datasets in the Terminal
Docker, Aleph, and Making Datasets Searchable
Reading Other People’s Email
Python Programming
An Introduction to Python
Working with Data in Python
Structured Data
BlueLeaks, Black Lives Matter, and the CSV File Format
BlueLeaks Explorer
Parler, the January 6 Insurrection, and the JSON File Format
Epik Fail, Extremism Research, and SQL Databases
Case Studies
Pandemic Profiteers and COVID-19 Disinformation
Neo-Nazis and Their Chat Rooms
Afterword
Appendix A: Solutions to Common WSL Problems
Appendix B: Scraping the Web
Index