Test data from production and data protection

Posted in testing on December 11, 2016 by Adrian Wyssmann ‐ 5 min read

The use production data for testing is tempting but as already mentioned in my previous post Test Data Management it should be considered carefully due to legal reasons as in US and also in Europe there are strict laws and related penalties in case of not complying with these rules. The recently strengthened EU data protection rules - Regulation (EU) 2016/679 and Directive (EU) 2016/680 - shall ensure that people’s personal information is protected – no matter where it is sent, processed or stored – even outside the EU.

‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organization, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction; Regulation 2016/679

Furthermore

… The personal data should be adequate, relevant and limited to what is necessary for the purposes for which they are processed … the period for which the personal data are stored is limited to a strict minimum. Personal data should be processed only if the purpose of the processing could not reasonably be fulfilled by other means. … Personal data should be processed in a manner that ensures appropriate security and confidentiality of the personal data, including for preventing unauthorised access to or use of personal data and the equipment used for the processing. Regulation 2016/679, Par. 39

The principles of data protection should apply to any information concerning an identified or identifiable natural person unless they are anonymized

… The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes Regulation 2016/679, Par. 26

So you can see, the EU data protection law differentiate different categories of data, namely three:

  • Personal data means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person; Regulation 2016/679, Art. 4
  • Anonymous data means data which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is no longer identifiable. Regulation 2016/679, Par. 36
  • Pseudonymous data means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organizational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;Regulation 2016/679, Art. 4

Data Anonymization

Also in the context of testing - more specifically, if you want to use productive data for testing - one has to follow these strict regulations or anonymize the data. Data anonymisation and pseudonymisation are encouraged by the regulation but may be not in all circumstances possible due to the nature of the test data we need e.g. demographic related test data. If one decide to use productive data, a proper anonymization is very important but is also very difficult. The Article 29 consider three risks concerning data anonymization

  • Singling out, refers to the possibility to isolate some or all records which identify an individual in the dataset
  • Linkability, refers to the ability to link (at least) two records concerning the same data subject or a group of data subjects
  • Inference, refers to the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes

There exist a variety of different anonymization techniques which. The DATA PROTECTION WORKING PARTY analyses the effectiveness and limits of existing anonymization techniques against the EU legal background. The also provide recommendations to handle the different techniques in relation to the residual risk of identification after applying these techniques. Their findings are provided in the following article and shall be considered for data anonymization: http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf

I will not go into the detail of the different techniques, but I would like to mention the following table out of the article which makes clear that a certain technique does not necessarily eliminate all risks for data identification:

Technique Is Singling out still a risk? Is Linkability still a risk? Is Inference still a risk?
Noise Addition Yes May not May not
Substitution Yes Yes May not
Aggregation No Yes Yes
L-diversity No Yes May not
Differential privacy May not May not May not
Hashing/Tokenization Yes Yes May not

Among the different techniques discussed in the article, differential privacy provides the highest anonymization concerning the EU data protection law. But depending on the risk to mitigate and the to whom the data will be provided other techniques may be used or applied.

Beyond Anonymization

Beside of personal data which relates to an identified or identifiable natural person there may be also non-personal information which cannot be simply processed i.e. used for testing. An example is the article 5(3) of the e-Privacy Directive which prevents “store information or to gain access to information stored in the terminal equipment of a subscriber” without detailed explication of the purpose to and explicit permission from the user. I am not a lawyer, nor do I know all the rules and laws but most certainly there are more such rules which regulates the usage of data.

Conclusion

Using productive data for testing your software may be necessary as generating such data artificially could be cumbersome. However, have clear that - depending in which country you are and where your production data belongs to - there are data protection laws which you(r company) has to follow to not have legal problems. Checkout whether your company has guidance on how data has to be handled and if you have doubts consult a legal expert for that matter.