Get Mousse at Home:



Satoshi Nakamoto

Stylometry: Part I

by John Menick

“Stylometry” is the first of a two-part series by artist and programmer, John Menick. Part one investigates the field stylometry—the computational study of writing style as a means to author identification. Drawing on ways of analyzing anonymous or disputed documents, stylometry has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare’s works to forensic linguistics.


It is possible that every literary work—every essay, short story, novel, poem—has a collection of characteristics that unambiguously belongs to one author and one author only. Every written work, literary or not—instant messenger transcript, email, tweet—may also have these characteristics. These characteristics are discoverable; they can be grouped, enumerated, quantified, diagrammed, published, studied. Once these stylistic characteristics are known, once they are taken in aggregate, it is possible that their contours could be as unambiguous
as the loops and whorls of a physical fingerprint. Whether an author possesses one stylistic fingerprint or ten is difficult to say. It also is unknown whether the fingerprint or fingerprints are evident to the casual reader, or whether these characteristics, like latent fingerprints, are only known after complex, expert processing. But if these characteristics do exist, if they can be enumerated and quantified, then this same author, through modest tricks and creative reshuffling, could also evade stylistic profiling. One writer, like a safecracker wearing latex fingertips, could masquerade as another author. Like a safecracker sanding down fingerprints, he or she could erase identity, becoming anonymous, a statistical non-entity.


Metzger, Dowdeswell & Co. LLC is a New York computer security firm founded in 2003 by computer scientists Perry Metzger and Roland Dowdeswell. The firm’s website,, is a single page consisting solely of the company title plus three sentences: “This web site does not yet have content. Please come back later. Click here for the Cryptography mailing list.” The last sentence, a hyperlink, links a visitor to the policy statement for “a low-noise moderated mailing list devoted to cryptographic technology and its political impact.” On Friday, October 31, 2008, at 2:10pm Eastern Daylight Time, someone using the name “Satoshi Nakamoto” posted a message to this list with the subject line: “Bitcoin P2P e-cash paper.” The message contained a link to an academic paper outlining a “purely peer-to-peer version of electronic cash [that] would allow online payments to be sent directly from one party to another without going through a financial institution.”
Within a few days, list members responded to Nakamoto’s paper. Members questioned the amount of bandwidth the system would require over time, the security implications involved with verifying transactions, as well as the basic rules for the system. Satoshi Nakamoto civilly responded to each objection, sending many emails over the next few months. All exchanges were carried out in English, and, at least in Nakamoto’s case, very good English. The English was British in its spelling, and his sentences were separated by double spaces—an idiosyncratic stylistic choice. The Bitcoin paper uses the same patterns, with double spaces and British spellings. By 2009, Nakamoto had released the first version of the Bitcoin software implementing the ideas outlined in the paper. That same year, Nakamoto joined the forum, again using British spellings and double spaces. Over a period of almost one year, he posted 574 messages on alone. In April 2011, Satoshi Nakamoto sent an email to a fellow Bitcoin developer, stating that he was no longer participating in the community. He wrote that he was “moving on to other things.” After that, Bitcoin’s Satoshi Nakamoto disappeared.


Thank you for participating in this study. The study has three tasks, to be completed in the following order. For the first part, please submit 6500 words of your own writing. All submitted writing must have been done for a “formal purpose”—essays for publication, school papers, professional reports, etc. All citations, editing notes, and writing that is not your own should be removed. Quotations are to be kept to a minimum.
The second part should be a piece of new writing, 500 words, written in a manner that obscures your writing style. This new, “obfuscated” work should describe your neighborhood to someone that has never visited the place. It should be a work of description, although please feel free to include any other relevant details regarding your neighborhood.
Before you begin the third part, please read the short work of fiction provided with the study. The work is an excerpt from Cormac McCarthy’s 2006 novel, The Road. After reading the excerpt, write 500 words describing a day of your life in the style of Cormac McCarthy. The style of the piece should be as close to McCarthy’s style as possible. Write the piece from a third-person perspective. If you wish, you may only describe part of your day. You can include actual or fictitious events.


The field of stylometry is devoted to identifying authorship through linguistic analysis. Given an unattributed written work and a large corpus of works by potential authors, stylometry is able to match the unattributed work to an author. Stylometric techniques can also compare anonymous works to each other and then attribute these works to multiple Jane and John Does. The stylistic features used by stylometry are numerous, including sentence length, vocabulary diversity, word-length distribution, Gunning fog and Flesch-Kincaid readability tests, punctuation, function words, specific jargon, grammatical errors, idiosyncratic usage, and cultural differences in spelling. Contemporary stylometry has become a thoroughly computational field, with genetic algorithms and neural networks doing the heavy lifting. Despite this, there is controversy as to whether stylometry is an accurate enough science to be used in court cases, and, within the field, there is no consensus regarding a standard set of stylometric techniques.
Among researchers, a high moment of stylometric analysis is represented by the problem of attributing the authorship of 12 of the 85 Federalist Papers. Published anonymously by Alexander Hamilton, John Jay and James Madison in 1787 and 1788, the Federalist Papers attempted to persuade the citizens of New York State to ratify the United States Constitution. Most of the essays have been attributed to the three authors, with 12 not attributed to any of the three. Stylometry, in part, gave some weight to the theory that James Madison was the author of all of the unattributed papers, though there was not complete agreement among all stylometric studies; and, for some scholars, stylometry only provided an affirmation of what had already been determined though other means.
Until recently, few in the field considering the possibility that an author could dodge stylometric detection by consciously changing stylistic tics, swapping words for synonyms, shrinking or expanding vocabulary, or reshuffling usual punctuation. The author could also employ pastiche, masquerading as another author and consequently causing a false attribution to that second author. A text might be a collaborative effort, too, with either several authors contributing different passages, or an unknown number of collaborators offering stylistic cover for the original author. Software as well might be used to introduce consistent stylistic changes to the text, either by using a custom application, or by processing a text repeatedly through an online translation service. In 2012, Michael Brennan, Sadia Afroz and Rachel Greenstadt raised these questions in their paper, “Adversarial Stylometry: Circumventing Authorship Recognition to Preserve Privacy and Anonymity.” When the three researchers tested the same stylometric methods on both “adversarial” texts and non-adversarial, the “honest” texts were detected almost all of the time, whereas the adversarial texts by the same authors scored at about random, depending on the methods used and tested. Three methods were used to produce the adversarial texts: in the first, the participants were free to use any method they saw appropriate to evading detection. The results were enormously effective. When using only five participants without any prior experience in adversarial stylometry, the participants were able to drop detection rates from above 70 percent to less than 20—i.e., below random. Using the second technique (writing in the style of another author—in this case, Cormac McCarthy) was less effective than the first, but still dropped detection rates to between 50 and 60 percent. Unexpectedly, the only technique that did not work well, dropping detection only slightly, was translating the text through Google Translate and Bing Translator. The researchers found that two-step translation, where the text is sent through two languages, rather than just one, proved more effective. For those who wished to evade detection, the news was good: one need only attempt to change one’s style and, most likely, it will work.
In their paper, Brennan, Afroz and Greenstadt explicitly state that they are concerned with preserving online anonymity. The three cite as an example a whistleblower who is writing an anonymous blog about her workplace and who might be in need of a tool that helps evade stylometric detection. They quote WikiLeak’s Daniel Domscheit-Berg as saying that if the WikiLeaks organizational documents had been subjected to stylometry, it would have become obvious that only two people wrote most of the public releases. (Publicly, WikiLeaks was claiming it employed more than two people.) Along with their research, Brennan, Afroz and Greenstadt also developed an application, Anonymouth—what they call an “authorship evasion (anonymization) framework.” Although their research suggests that an amateur can evade detection simply by changing his or her style, software could make that process more likely to succeed. As of today, this software does not exist for general use. Until it does, though, stylometric evasion will remain a specialist preoccupation, available only to spies, hackers, linguists, and criminals who might also take an interest in academic computer science journals.


The first decade of online identity can be characterized by discontinuity, by multiple digital selves scattered across bulletin boards, comment sections, IRC rooms, and mailing lists. Each site and service required a new user account, each user account began a different identity, each identity isolated from all others. The reasons were technical: without a standard like OAuth, a user’s data trails were isolated on a host’s server, unable to be pooled across services and cross-referenced. In a single day, you could join a bulletin board devoted to socialist politics, argue about Hollywood careers on a Usenet group, and participate in a mailing list devoted to erotic fiction, without governments, corporations, or other users linking the accounts to you. Citizens of this disjointed republic were not technically anonymous; they were pseudonymous. With identities siloed, a user was a username, often several usernames, and revelations of offline life were rare and untrustworthy. While pseudonymity created freer speech, with this freer speech came violence: stalking, intimidation, trolling. Sock puppets— multiple accounts created by a single person—could artificially amplify a user’s opinions, tipping the balance of a debate away from more honest participants. Without reputation systems to rate online exchanges, online marketplaces, too, were shaded with uncertainty and fraud. The Internet was seen as being prone to massive con jobs; it was a place of unverifiable personalities and untrustworthy services—all of which provided material for magazine feature stories and deathly serious public service announcements. If corporations were going to survive a digital economy, the neighborhood needed gentrifying.
To do so, Silicon Valley created its own ontological watchdogs. In Facebook’s terms of service, for example, a username must be tied to real person. If the user does not exist, the account will be banned. With one account per person, a user’s identity was more probable. With a more probable identity, data could be more reliably mined—and, of course, advertising could be better targeted. Age, marital status, reading habits, mood, health, geographic location— not to mention visual representation and financial data—produced an online subject whose financial desires could be modeled, whose future purchases could be predicted. Facebook gained an advantage in this standardization of identity by beginning on American university campuses. Admission to the site was only granted to students with an edu email account, therefore making it more likely the account matched an offline identity. When the site opened to high school students, it was on an invitation-only basis—again improving the chances that each account represented a person. When Facebook opened its platform to third-party developers, the company gained the ability to track users across multiple websites, thus allowing for data aggregation far beyond the confines of Like an updated Alphonse Bertillon crossed with the Pinkerton Agency, Facebook created an unprecedented profiling and worldwide surveillance system, with every digital trace tightening the focus on a user’s identity.
While Silicon Valley was diligently converting users into marketable subjects, several counter-movements emerged. The image-sharing site 4chan was launched one year before Facebook, and while 4chan was not built with a conscious political program, anonymity is central to its anarchic culture. Like its Japanese template, 2chan, a website devoted to anime discussions, 4chan does not offer registration for its users. All 4chan users must post as “Anonymous,” a design choice that exacerbates much of the site’s extreme imagery and discussion. An immense meme machine, one can find any kind of imagery on 4chan, from saccharine pet photos to scatological pornography, with brief and unwelcome glimpses of child porn and real gore along the way. 4chan quickly became the Internet’s unappeasable id—a place one could go to find any image, imaginable or not. Improbably, in 2008, 4chan’s frantic culture gave rise to Anonymous, the hacker activist movement. From the Internet’s id, then, came its superego: a near-vigilante hacker movement that declared war on everyone from the Church of Scientology to George W. Bush. Like Occupy Wall Street after it, the movement was headless, moved by a tacit political cohesiveness that may or may not have existed. Anonymous deleted all proper names and bylines; if the movement was to be effective, it had to be masked, preferably with Guy Fawkes.
On 4chan, meanwhile, anonymity was a promise, not a guarantee. IP addresses were and are logged, and for a government agency surveying 4chan, a user’s identity could be discovered with routing police work. (4chan has cooperated with many police investigations over the years.) It would take a politicized cryptography community, the cypherpunk movement, to deliver the material necessary for anonymity. With the invention of public-key encryption in the 1970s, military grade encryption became available to the general public for the first time. By the early 1990s, a group of encryption experts gathered around a list hosted by a firm based in San Francisco, Cygnus Solutions. Their discussions went beyond protocols, however: driven by libertarian politics verging on digital survivalism, list members, later dubbed “Cypherpunks,” fantasized about a society based on decentralized cryptography, rather than human institutions. Though some members were on the political left, most were rightwing libertarians who found enemies in corporate American and the federal government alike. Central to their discussions was the concept of social trust: as a society, we put our trust in various institutions—we believe that a bank will keep our money in our checking accounts, and we know this bank will accurately update our balances when we make exchanges. But, according to cypherpunks, all institutions will eventually break our trust, as seen in the recent mortgage crisis. Since cryptography is based on strong mathematical proof, one could construct cryptography systems to verify identities and transactions without any need of a human institution. We would not need a bank to tell us a balance in a checking account; a cryptographic system could do so more reliably. Through concepts like “proof of work” one could prove, given a cryptographic algorithm, that a computer had done a certain amount of work, regardless of the purpose of that work. It was a libertarian mathematician’s dreamland: a society run not by untrustworthy humanity, but by the uncorrupted reason of mathematics.


To the top