Recently, we collected user reasons for killing HPC jobs. To publish the corresponding data set and data processing scripts, I made a copy of the scripts and data, associated each username with a unique id of the form “uX” (where X is a number), replaced each occurrence of a username with its id in the data, and processed the anonymized data for verification.
To my shock, the results did not match the results from unanonymized. WTH??
After a day of digging around (Yikes!), I figured that I had baked in some assumptions in to my data processing Awk scripts (yes, Awk works great for many common data wrangling and processing tasks) when experimenting with unanonymized data. These assumptions were broken by the anonymization scheme. Crap!!
I changed the anonymization scheme to respect the structure of usernames. From a username, I constructed its unique id by replacing letters from the English alphabet by letters from the English alphabet and digits by digits while maintaining the order of letters and digits. For example, john23 was anonymized as oiwd85.
With this anonymization scheme, the results from processing anonymized data set and unanonymized data set were identical. Woot!!
Assumptions: know the data and change the data but don’t break the data.
Here’s the script for this anonymization scheme.
In this effort, data processing had two phases to it. At the end of the first phase, there was human intervention to classify the data. So, I retained username until the end of the first phase and then I anonymized the data. This was wrong.
Instead, I should have anonymized the data before processing/analysis it and built support to de-anonymize it.
As for my data processing scripts, they could have been more flexible and permissible. That said, anonymization at the beginning would have allowed me to be productive even with my brittle scripts :)
So, the lesson for the day “If you have to anonymize data, anonymize it before processing/analyzing it.”