At the very end of last month (August 2007) I wrote an article about 'data leaking' for MacUser magazine. It covered subjects such as meta data stored in Word document files, out of date thumbnails stored in JPEG images and feeble attempts at censorship using PDF documents.
The article mentioned some well-publicised events where documents were published on the internet, only for readers to extract more data from them than was intended. These were quite old stories, but they illustrated the possibility that you can leak data, sometimes pretty scary data, unintentionally. But, just when you think that people have learned their lesson, someone repeats the old mistakes.
Today I learned of a new data leak incident that was caused by an old error. The Fédération Internationale de l' Automobile (FIA) has published transcripts from the World Motor Sports Council hearings into allegations that McLaren was spying on Ferrari. The organisation used the PDF format and redacted certain sensitive details, such as a mention of a "double-rear master cylinder with a spring." After all, it is bad enough that Ferrari had its secrets stolen by a rival - let's not publish these same secrets online for the world to see.
This redaction was accomplished by placing black rectangles over the most sensitive phrases. And guess what? As we've seen before, copying the text in PDFs and pasting to a text file can pull in the text from underneath these black boxes. Just like when the New York Times published leaked CIA details, readers sussed out that the redaction was fluff and extracted the sensitive data. Probably no one (certainly me) would have cared about a master cylinder, spring or no spring, had this transcript not turned into a data protection gaffe.
The redacted text under the black boxes can be extracted as easily as Ctrl-A, Ctrl-C, run notepad, Ctrl-V.
Now, naturally the FIA's website addressed this privacy issue when it discovered that its censorship technique was flawed. If you visit the site now and attempt to download the transcript you'll get a more effectively redacted version. But, what with the internet being full of personal websites/blogs, you would not expect the sensitive data to just disappear. It's pretty easy to find copies of the document.
I found some 'de-classified' copies by picking a random phrase from the censored version and using Google to look for pages and files containing it. At least a couple of the results were for copies of the unredacted document. Now I am not exactly in a position to build a record-breaking Formula One racing car, but this exercise does demonstrate that, once the cat is out of the bag, it's out to stay.