In late 2006, I did a study to learn who wrote Wikipedia. I published my conclusions in the article Who Writes Wikipedia?. I am currently working on a larger replication of the study for publication.
To prepare for the study, I examined a random page (Alan Alda) using Wikipedia’s history feature to see how it was created, edit by edit. The changes fell into roughly three groups:
About 5 of the nearly 400 edits were what Wikipedia calls vandalism: confused or malicious people adding things that simply didn’t fit, followed by someone undoing their change.
By far the vast majority were small changes: people fixing typos, formatting, links, categories, and so on, making the article a little nicer but not adding much in the way of substance.
Finally, a much smaller number were genuine additions: a couple sentences or even paragraphs of new information added to the page.
For the substantive edits, I investigated the other contributions from that user. Almost all were not active contributors. Usually, they’d made less than 50 edits (typically around 10) and usually on related pages. Most never even bothered to create an account.
To investigate the issue more formally, I decided to run an algorithm over the history to automatically calculate who contributed what.
The first question is what counts as a contribution. I didn’t want an algorithm that awarded points for vandalism and one that was biased more towards genuine additions than towards small changes. I tried several things, but here’s the one I found most effective and eventually used:
For each page:
Set final
to the current (i.e. latest) version of the page.
For each version
of the page, moving from oldest to newest:
a. Use Python’s difflib.SequenceMatcher.find_longest_match
to find passages of text shared between version
and final
b. Tag any untagged portion of the match in final
as coming from version
You should now have a final
which is tagged with which version each character is from; you can now count the characters contributed by each user.
(Footnote: I used find_longest_match
instead of get_matching_blocks
because get_matching_blocks
didn’t properly handle blocks being reordered.)
Once I had verified the algorithm (I ran it on one page and hand-checked the results), I grabbed a copy of enwiki-20060717-pages-meta-history.xml.bz2
and split it up into pages. I distributed the resulting pages across a cluster of machines and had each one run the algorithm on the pages, generating character counts for each user as output.
I then analyzed some of the files (e.g. Alan Alda) in detail, looking at the top contributors and their editcounts. For the remainder, I looked through to see if any of them had a particularly high percentage of the content written by any one user.
For the Alan Alda article, 8 out of the top 10 are unregistered and 6 out of the top 10 have made less than 25 edits on Wikipedia. #9 has made just the one edit.
For comparison, if you simply count edits, 7 out of the top 10 are registered users and 5 of those have made thousands of edits to the site. #4 has made 7,000 edits and #7 has made 25,000.
Other articles showed similar results. For example, the largest portion of “Anaconda” was made in two edits by a user who has only made 100 edits to Wikipedia, while the user with the largest number of edits contributed zero text to the final article (they only deleted and moved text contributed by others).
I ran the algorithm on 200 articles and found only a handful where significant portions were written by particular users. But even these cases turned out to be confirmatory upon inspection.
“Alkane” was largely contributed by Physchim62. Some have argued that while popular culture pages may be written by thousands of editors, Wikipedia’s more technical pages must be written by dedicated experts. This seemed to provide confirming evidence. But further investigation found that Physchim62 did not write the article themselves but simply translated the article from the German Wikipedia.
“Characters in Atlas Shrugged” was largely contributed by CatherineMunro. It seemed plausible that such a page could be written by a dedicated fan, but investigation found that in fact CatherineMunro simply merged text together from other pages.
“Anchorage, Alaska” was largely contirbuted by JeffreyAllen1975. Simple investigation found the contributions to be quite substantive and genuine with numerous edits, each contributing about a paragraph. The effort seemed to take its toll; his user page noted “I just got burned-out and tired of the online encyclopedia. My time is being taken away from me by being with Wikipedia.” He was an active contributor for only four months.
But I continued to investigate. The page contained a complaint noting that “The current version of the article or section reads like an advertisement.” Googling revealed why: JeffreyAllen1975’s contributions had been copied-and-pasted from other websites, like the Anchorage Chamber of Commerce (“Anchorage’s public school system is ranked among the best in the nation. … The district’s average SAT and ACT College entrance exam scores are consistently above the national average and Advanced Placement courses are offered at each of the district’s larger high schools.”).
(I suspect JeffreyAllen1975 didn’t know what he was doing; his writing style suggests he’s just a kid: “In my free time, I am very proud of my-self by how much I’ve learned by making good edits on Wikipedia articles.” I’m pretty sure he just thought he was helping the project: “Wikipedia is like the real encyclopedia books (A thru Z) that you see in the library, but better.” But his plagiarism will still have to be removed.)
None of the articles in my sample appeared to have significant portions written by a single user.
I am currently working on a larger replication of the study for publication. Contact me if you’re interested in assisting. My name is Aaron Swartz and my email is me@aaronsw.com.