[ ← Back to the article ]

The details: I downloaded a copy of the enwiki-20060717-pages-meta-history.xml.bz2 archive, broke it up into pages, iterated over the revisions and recursively applied Python's difflib.SequenceMatcher.find_longest_match to each revision and the latest revision. (I used find_longest_match instead of get_matching_blocks because get_matching_blocks didn't properly handle blocks being reordered.) I only counted the characters which hadn't already been matched by an earlier revision.

[ ← Back to the article ]