researchers on social sites is a still-evolving ethical domain. Regardless of who technically has access to their information, people tend to have a mental model of who they expect to read their posts, and feel that their trust is violated when someone outside that model does so. When the Library of Congress announced in 2010 that they’d be archiving every single tweet, Twitter users had to update their mental models for a previously ephemeral website. Many reacted by posting tongue-in-cheek instructions or commentary to future historians. Several people took advantage of the opportunity to make the august institution expand its holdings of choice four-letter words, while others asked, “What’s up, posterity?” or noted, “Please index all my kitten pictures properly under ‘kitteh’ as well as ‘kitten’ now that you’re saving my tweets.” Not much came of it, in the end: the Library of Congress changed course in 2017, restricting their Twitter archive to tweets that met stricter criteria of newsworthiness. A less benign social media data controversy happened in 2018, when British political consulting firm Cambridge Analytica was discovered to have obtained personal data from millions of Facebook users in 2015 by convincing people to link a personality quiz with their Facebook account. The personal data derived from the quiz was then used to target voters and potentially sway elections. The Library of Congress and Cambridge Analytica represent two extremes, but less publicized researchers have continued mining for data on social media, restricted only by terms of service and their own senses of fair play.
In this book, I have for the most part restricted my citations to social media data in aggregate, not linked to individual users, or examples which are already cited and anonymized in research papers. But where I’ve needed to pull out individual examples, I’ve aimed for those in which the writers are already clearly having a metalinguistic discussion, like the tweets addressing the Library of Congress archivists. Quoting people’s innocent chatter about their lunch or deeply personal heart-to-hearts felt to me uncomfortably like spying, but quoting comments about internet language in a book about internet language is, I hope, a way of entering into a conversation. After all, if you’re going to address your tweets to posterity, perhaps you shouldn’t be surprised when posterity addresses you back.
Twitter research is especially fruitful because about 1 to 2 percent of people who post on Twitter tag their tweets with their exact geographic coordinates. A reasonably competent data miner can therefore code up a county-level map of where Americans tweet “pop” versus “soda,” where they switch from “y’all” to “you guys,” or which states prefer which swear words—all in less time than it took Edmond Edmont to bike from Paris to Marseille. As a simple proof of concept, let’s look at the work of the linguist Jacob Eisenstein, who found that geo-tagged tweets containing “hella” (as in “That movie was hella long”) are most likely to occur in Northern California, while those containing “yinz” (as in “I’ll see yinz later”) are clustered around Pittsburgh. Both of these findings are consistent with previous linguistic research done in the labor-intensive interview style. Other features he found on Twitter probably wouldn’t have shown up in an interview: a later study by Eisenstein and colleagues found that the abbreviation “ikr” (“I know, right?”) was especially popular in Detroit, the emoticon ^_^ (happy) was characteristic of Southern California, and the spelling “suttin” (“something”) was popular in New York City.
Some of the linguistics research happening on Twitter wouldn’t be possible at all without the internet. The linguist Jack Grieve researches constructions like “might could,” “may can,” and “might should” in the American South—things like “We might should close the window,” where speakers of other dialects would say, “Maybe we should close the window.” Grieve has pointed out that as recently as 1973, prominent linguists said that it would simply be impossible to research these constructions: they’re vanishingly rare in edited text, and occur maybe once an hour, if you’re lucky, in a spontaneous spoken interview. That’s a heck of a lot of audio to transcribe for a tiny amount of data. But on Twitter, Grieve and his collaborators combed through nearly a billion geo-coded tweets and unearthed thousands of examples. Beyond just reinforcing the informal intuition that these constructions (known as double modals) exist, they’ve been able to make detailed county-level maps showing that they can actually be divided into two groups: some, like “might could” and “may can,” map onto the Upper South, while