[ ← Back to the article ]
The details: I downloaded a copy of the enwiki-20060717-pages-meta-history.xml.bz2 archive, broke it up into pages, iterated over the revisions and recursively applied Python's difflib.SequenceMatcher.find_longest_match
to each revision and the latest revision. (I used find_longest_match
instead of get_matching_blocks
because get_matching_blocks
didn't properly handle blocks being reordered.) I only counted the characters which hadn't already been matched by an earlier revision.
[ ← Back to the article ]