Consumer Data Research Centre

Case Studies Demonstrating Techniques to Achieve Data Minimisation


On 25 May 2018 the General Data Protection Regulation (GDPR) comes into force and all processing of personal data taking place on or after that date must be compliant with it. Below is part 4 of a report authored by Veale Wasbrough Vizards (VWV) in partnership with the CDRC on how GDPR will impact upon social science research.

Case Studies Demonstrating Techniques to Achieve Data Minimisation

Case Study 1: Use of Key-Coded Data

  1. A research study is being carried out by CDRC using data provided by a retailer. The data consists of customer IDs, postcode and transaction data. The retailer retains the ‘key’ to the customer ID but would provide it if asked.

Such use of key-coded data is a common pseudonymisation technique. All use of the pseudonymised data should be carried out by people who are trained on GDPR requirements, bound by obligations of confidentiality and subject to restrictions regarding re-use and re-identification. Pseudonymised data are treated under the same strict security protocols as all Personal Data and at the CDRC analysis is only undertaken on secure servers.

  1. A clinical study is being carried out for a pharmaceutical company by clinical investigators. In this study only key-coded data are passed by clinical investigators to the company which is sponsoring the research.

The decryption keys are stored at the study site by the clinical investigators who are trained in GDPR requirements of their institutions and are bound by professional duties of good clinical practice and confidentiality.  The sponsor company is not authorised to call for the decryption keys.

Such use of key-coded data is a common pseudonymisation technique. All use of the pseudonymised data should be carried out by persons trained on the GDPR requirements and boundby obligations of confidentiality and subject to restrictions regarding re-use and re-identification.

When using an encryption key it is important to ensure that:

  • the same key is not used for different datasets as this would increase the risk of different datasets being linked; and
  • the key is stored securely at all times.

Case Study 2: Maintaining the link between data values attributable to the same individual

A research institution has been provided with data that were originally sourced from a mobile app that uses Global Positioning System geo-referencing to infer measures of the speed at which users of the App run.  The research institution would like to analyse the data to derive information about the average running speed of each user when using the app.

The research institution intends to process the following data about each user featured in the dataset to carry out this research:

  • User ID;
  • number of runs made by the user in a particular month; and
  • the distances covered and the times taken to complete each run.

To minimise the amount of personal data in the dataset, the research institution would like to replace the User ID numbers with artificial values. Although the User ID numbers are to be replaced, the research institution does not want to lose the link between different runs made by the same individual. This could be achieved using one of the following techniques:

  1. encryption e.g. using the Advanced Encryption Standard AES encryption algorithm – this will ensure that identical original values are always mapped to identical modified values and that non-identical original values are always mapped to non-identical modified values;
  2. tokenisation e.g. using a mapping table.

Case Study 3: Hashing

  1. A research institution is carrying out research into the amount of coffee that is drunk by people of different ages. It is provided with data (an extract of which is shown in the table below). The data are supplied by a café chain and detail the number of coffee purchases made by loyalty scheme participants along with their dates of birth.
Loyalty Card No Date of birth Number of coffee purchases per month
15618 08/11/1992 18
14555 09/05/1971 3

The nature of the data means that hashing and data banding could be used to considerably reduce the likelihood of re-identification as demonstrated below.

Hashed loyalty card holder ID Age band Number of Coffees drunk per month
ea840312edaf4c00a97c5d89cbf11f8fa4c411d1b1be274415d5b64b55adf0c6 21-30 18
c3ec998eb12f3655104d47770a0d71b4a1f3ef287d6ccf032d60403864b11d2f 41-50 3


Here the SHA-256 cryptographic hash function has been used.

  1. Footfall sensors detect and record WiFi probes from devices located or passing in close proximity. The codes are hashed at the point of collection and then these codes are used to identify stationary devices such as printers during data cleaning. The individual hashed observations are then sent to the researchers’ secure server where they are cleaned to provide estimates of footfall.

Case Study 4: Ethnicity Estimation

Stage 1: Researchers obtain the names of a sample of company directors from Companies House website. The names of all Kenyan students graduating in 2018 are obtained from different institutional websites registered in that country.

Stage 2: Following successful application, CDRC provides those researchers with software by secure file transfer. This makes it possible for those researchers to estimate the ethnicity of every company director and the gender of every Kenyan graduate. The results are used to research (a) the success of individuals of Kenyan origin in becoming company directors and (b) whether success is proportional to academic prowess as indicated by male and female success in obtaining academic qualifications in Kenya. Although the researchers makes predictions at the level of the human individual these are kept on the researchers’ secure site and results are only published in aggregate with suppression of very small data sets.

** If the Office for National Statistics were unhappy with individual level profiling, would it also be acceptable to conduct this analysis by reporting aggregated figures of probable Kenyan company directors (with suppression if the estimate is below 3)? (Similarly, the software would be adapted to provide only an aggregate estimate of educational attainment.)

** If, instead of supplying software, CDRC took delivery of a list of paired given and family names (without knowing whether or not they pertained to any living individuals), could it code up the names by predicted ethnicity and age and send the results back to the supplier? If CDRC has no measure of obtaining or matching with further data that would identify those names as relating to living individuals, this would not be personal data in CDRC’s hands. However, if would be personal data if the hands of the supplier who would have to comply fully with GDPR requirements. Would any secure data transfer or pseudonymisation procedures be required?

Case Study 5: A Synthetic Population

Stage 1: Data are collected for the Census of Population by the Office for National Statistics, and aggregated small area statistics on ethnicity, employment and age are published (subject to suppression of small counts in accordance with established anonymization procedures).

Stage 2: A synthetic population model is applied to the Census data to model the ethnicity, employment status and age of every adult in Anytown.

Stage 3:  Names and addresses are obtained for eligible voters whose names appear on the contemporaneous public version of the Electoral Register (estimated as 80 per cent of all voters and 65 per cent of the resident adult population). These data are matched to the Census data at small area level. The ethnicity and age of each named individual is estimated based upon their forename and surname.

Stage 4: The names and addresses from the Electoral Register are matched with members of the synthetic population in order to add estimates of the likely employment status of those eligible to vote. The results are used to analyse differences in the employment status of voters relative to non-voters in Anytown. 

Stage 5: In order to extend the analysis and to better understand the employment characteristics of non-voters, names and addresses of individuals who are likely ineligible to vote are obtained from a data reseller. All the processing takes place on CDRC’s secure servers.  Consents for re-use of the data were obtained when the data were collected in 2011 and a notice advising of the study has been posted on the CDRC website. The modelling and matching procedures are used to create a further classification of the ethnicity and age characteristics of Anytown adults that are ineligible to vote, as well as individuals who opted out of inclusion in the public Electoral Register.

Note on synthetic population data

Neither CDRC nor its client will take any measures or decisions in relation to any particular individual. There are no other factors about the project that are likely to cause substantial damage or distress to any individual. The safeguarding conditions have therefore been met. Provision of transparency notices to individual data subjects would take so long and cost so much that the research project would not be viable. Since this would be a disproportionate effort and giving that the safeguarding conditions have been met, CDRC does not need to provide transparency information to each individual.

It is conceivable that the synthetic population model could allow information to be inferred about natural individuals.  For example: assume that only a minute proportion of Anytown’s population belong to a particular ethnic group, and that of this minority, only a tiny number of individuals are employed and above the age of 75. This could hypothetically make it possible to make inferences about natural individuals based on the data held in corresponding profiles in the synthetic population.

This risk is highlighted in the EU’s Article 29 Working Party paper on Big Data[1]:

big data processing operations do not always involve personal data. Nevertheless, the retention and analysis of huge amounts of personal data in big data environments require particular attention and care. Patterns relating to specific individuals may be identified, also by means of the increased availability of computer processing power and data mining capabilities.

Accordingly, the question of whether any data within a synthetic population could be personal data for GDPR purposes can only be answered by considering the risk that the data could enable information to be inferred about natural persons.

The data contained in a synthetic population can be considered to fall outside the scope of the GDPR if it falls within the GDPR’s definition of anonymous personal data. This will be data that does not relate to an identified or identifiable natural person or personal information which has been rendered anonymous in such a manner that the data subject is not or is no longer identifiable.

If there is any possibility of the data comprised in the synthetic population allowing for the identification of a data subject, then it is more likely for GDPR purposes that this is pseudonymised data than anonymous personal data and it should be treated as falling within the scope of the GDPR.

Case Study 6: Data Perturbation using Micro-aggregation

A research institution is carrying out research into the relative financial standing of individuals living in different postcodes of North London.

The researchers hold a dataset showing the age, gender, postcode, income and average monthly expenditure of 1,000 people living in North London.

The researchers believe there is a risk of identification if actual incomes are made available.

They choose to use micro-aggregation to disguise the actual incomes. They will replace the actual values of incomes in the datasets with average values for small groups of the units.

The groups all contain a minimum predefined number “k” of units .K is a threshold value and each group is called a k-partition.  

The researchers divide the 1,000 individuals in the dataset into fifty k-partitions of 20 individuals. The incomes of individuals falling into the same k-partition will be represented in the dataset using an identical value.

The advantage of this technique is that the mean income value for the whole population remains unchanged. If the researchers wished to increase the amount of data perturbation, they could apply micro-aggregation to all of the variables in the dataset in order to achieve k-partitions representing average values based on all of the variables in the data set. This is known as multivariate micro-aggregation.

Case Study 7: Encryption and noise addition

An online sportswear retailer analyses its customers’ buying habits so that it can recommend certain products to them.

The sportswear retailer along with some of its competitors has been asked to take part in a research initiative organised by a third party research body. The research requires correlating shoppers’ buying habits with public health data about arthritis.

Each participating retailer uses a secure-keyed cryptographic hash to create unique reference numbers from customers’ names and addresses. GP practices use the same algorithm to generate unique reference numbers from patient details. Once their datasets are created both the sportswear retailer and the GP practices delete the key used for hashing.

The datasets supplied by the sportswear retailer and GP practices will match together. The research body could add another round of encryption in order to ensure that neither the GP surgeries nor sportswear retailer could ever link the data back to individual patient’s or shopper’s identities.

In addition the randomization technique of noise addition could be used to make it harder for a third party to identify an individual should they be able to detect how the data has been modified.

 

 

[1] http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2014/wp221_en.pdf