Privacy regulations define anonymous information as information that does not relate to an individual but they give little guidance to make the definition actionable. In Europe, the Working Party 29 published an opinion to assess anonymization techniques. But, their approach focused mainly on legacy methods and failed to define a systematic framework applicable in all cases.
In this post, we will propose a formal definition of anonymization based on the notion of re-identification risk and inspired by recent research on privacy. We then introduce an actionable methodology to study anonymization.
Building blocks to study anonymization
Personal information
Personal information is information that relates to a natural person (e.g.: name, email address, location, online activity, bank statements, DNA, text messages, wishes, having responded to a survey or not…).
Identifying a user
Identifying someone means isolating an individual using some information unique to them with some level of confidence. Arguably, no identification method has probability 1 as any information may have errors.
Identifying someone can be based on any attribute or combination of attributes as long as it is unique to the individual (e.g.: name, face, the fact that they walk past your front door every morning). Take the name for instance. There are more than 44k John Smiths in the US so the name alone may not be identifying. But with a little more information factored in, things are quite different. On a list patients admitted to a hospital, the older John Smiths, living nearby, have a much higher chance of being identified with confidence. So {name + database_context} can already be identifying. We will use the term identification key to refer to the subset of information sufficient to identify someone with high enough probability.
Attackers and attacker model
We will call an attacker a person, a group, or a process that could try to extract personal information from a dataset. To paraphrase GDPR, the attacker model describes “all the means reasonably likely to be used” by the attacker. It includes technological means but also any knowledge, personal or not, possibly available to the attacker. It should even include information made available to the attacker in the future!
Re-identification
There is re-identification risk if an attacker can use the available data to extract new information related to an individual. The attacker can use all means available in the attacker model.
We make no formal distinction between social security numbers, GPS location, and names. All three can be identification keys. The important question is “can they be known to an attacker?”.
Note that a re-identification attack does not require to identify the user in the dataset itself. We only need that combining the attacker’s knowledge and the data allows extracting information on an identifiable individual. The fact that data is aggregated is at best anecdotal evidence.
What does it mean to “extract new personal information”
The concept of “new personal information” needs some clarification. We want it to focus on information linked to the individual’s data and not to general knowledge applicable to many individuals.
Take the link between smoking and cancer. We assume that an attacker knows that some individual smokes. Our attacker reads a medicine journal and learns there is a relationship between cancer and smoking. They learn something new about the individual: they are more likely to develop cancer. Does it mean that the medicine publication carries re-identification risk? It would be a very ineffective definition and divert us from the goal of preserving privacy. We don’t want to mix privacy risk (revealing something specific about someone) from the nonetheless important question of science ethics (scientific progress with adversarial effect on individuals).
To exclude general knowledge risk from our definition, we will say that there is a re-identification risk only if the new personal information would not have been extracted should this individual be removed or added to the source data (since the absence of a user can lead to re-identification, general knowledge should resist both adding or removing someone). The link between cancer and smoking holds when removing or adding any participant from the clinical trial — it is general knowledge.
We finally have a useful definition of re-identification risk: a dataset has re-identification risk if an attacker, with all means and information from the attacker model, can extract new personal information on an individual, not including general knowledge.
Defining anonymous information
GDPR defines “anonymous information [as] information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable (GDPR recital 26)”. We can rephrase this definition with our definition of re-identification risk: anonymous information is information with no significant re-identification risk under a specific attacker model.
Globally anonymous information
Ideally, we find a way to construct anonymous information that does not depend on the attacker model. This train of thoughts led to the introduction of Differential Privacy by Cynthia Dwork, which won her the Godel Prize. She gave a formal mathematical meaning to the concept of anonymous information so it can be rigorously studied without making arbitrary assumptions on the attacker. Here for a good introduction.
With differential privacy, we can design data-processing mechanisms that produce anonymous information irrespective of all attacker models. We will call data generated by this assumption-free approach globally anonymous.
One can show that the only way to be globally anonymous is to add randomness to the output of the mechanism. The noise should cover the impact of adding or removing any individual from the input data. The level of noise quantifies the level of privacy guarantees that can be reached.
A weaker definition: locally anonymous information
Being anonymous irrespective of attacker models is very appealing but also quite hard to achieve and protects against re-identification attacks that may be unrealistic in a real-world set-up. We can introduce a relaxed definition by making additional assumptions on the attacker model.
We say that information is locally anonymous if there is no re-identification risk under a specific attacker model. Being locally anonymous requires making explicit assumptions with regards to the additional information and means that can be leveraged.
When considering a deterministic anonymization technique, the same processing on data with one less individual is likely to yield different outputs. Someone with knowledge of the output without this individual learns something personal. It is then locally anonymous subject to the assumption that this slightly different output is not available to an attacker. This is called a differentiating attack and is a common way of doing re-identification. Most traditional data masking methods (pseudonymization, substitution, aggregation with or without k-anonymity, l-diversity, or t-closeness) cannot reach globally anonymous standard! One would need to list the hypotheses they make before claiming it is properly anonymized.
Examples
A video surveillance dataset with video of cars where the plates have been removed is never globally anonymous because someone who knows the make of cars of some people commuting on that route around that time can identify them. From the video they may learn a lot more on them too. It can be locally anonymous by making the assumptions that such information can never be available to our attackers.
The exact count of patients with cancer at a hospital is not globally anonymous either. Someone who knows the same count computed before the last admission and knows who was the last to be admitted would learn whether that patient had cancer. It could be locally anonymous with the assumption that no attacker would possess such counts.
A systematic framework to assess anonymization
Those definitions allow us to study all anonymization techniques.
1. Does the anonymization produce globally anonymous data?
Working with globally anonymous data removes the need to document the attacker model precisely. The overall compliance process becomes orders of magnitude lighter. Only variants of Differential Privacy can achieve the standard of globally anonymous data.
It does require proper parametrization but best practices from regulators and practitioners are slowly emerging making compliance analyses more scalable.
Provided that its parametrization is correct, such globally anonymous information could be deemed anonymous by design.
2. If not, did you document the attacker model in details?
If the released data is not globally anonymous, it may still be locally anonymous. First, it requires listing precisely and justifying the assumptions related to the attackers’ means. From there, one needs to prove that based on this attacker model, no re-identification attack can be built. This really resembles detective work: one should put themselves in the position of the attacker and look for all the ways they could use the released data to extract personal info. This self-evaluation process is very custom and needs to be thoroughly documented because the extent of attacker means can be daunting (even including information made public in the future). With more and more personal data being made public on social networks or from data leaks, there will never be a simple answer and evaluation may need revisiting over time.
If one does not perform this thorough analysis of what it takes to re-identify individuals and comparing it to reasonable attacker models, they are probably just doing luck-based anonymization!
3. A possible alternative to long lists of assumptions: Safe Harbors
A way to make this process more manageable could be to propose Safe Harbors like HIPAA does. For data that is so aggregated at a large scale, differential privacy might be overkill. This is however a risky endeavor and we will not make concrete recommendations as this will always be subject to debate as there is no objective metric. In HIPAA, having removed precise attributes is not sufficient, the practitioner still needs to declare they do not have “action knowledge” of re-identification risk, which is a lightweight version of defining an attacker model.
Conclusion
We have introduced a series of concepts:
- The attacker model which is all the additional information, general knowledge, and technical means available to the attacker.
- There is re-identification risk if an attacker, with means and additional information from the attacker model, can extract new personal information specific some individuals
- A dataset without significant re-identification risk is considered anonymous. If the absence of risk requires assumptions on the attacker model, we call it locally anonymous. If it is true irrespective of the attacker model, it is globally anonymous.
Each assessment of anonymous information is a matter of estimating the re-identification risk. Only globally anonymous information can withstand any attacker. When the released data is not globally anonymous, the burden of proof that attackers cannot carry re-identification attacks becomes a lot more challenging.
We believe this framework is fully consistent with modern privacy regulations. Globally anonymous information is likely to emerge as the default way to protect personal data. Locally anonymous works best for trusted environments where strong assumptions on attackers can be made.