The growing list of privacy-enhancing technologies (PET) can leave privacy professionals puzzled. This post explains what each technology achieves and when they make most sense.
We will consider protecting privacy with respect to the processing of personal information to produce insights (analyses, models, statistics). This virtually covers all of data science. We will take it for granted that proper data governance and security standards are in place — only users with the proper rights have access to data.We propose a framework to select the technologies that make the most sense based on the objectives and risks of such projects.
The two types of risks to consider in a privacy strategy
Data is stored by an Input Party that agrees to let an Output Party learn insights from it provided that personal information is sufficiently protected. The Output party defines what they are interested in learning (processing task). The task is executed in a staging environment and and the output of the processing is sent to the output parties.
During this process, many things can go wrong in a way that exposes private information (eavesdroppers, nefarious processing tasks, security breaches…). We define the Attacker Model as all the actions that may be undertaken by an attacker trying to get access to personal information. A privacy strategy is a response to a predefined attacker model.
Privacy risks come in two fundamentally different natures:
- Input privacy risks: this is the risk that a party gets undue access to some input data during staging or processing. It is linked to the architecture of the data flow and grows with the number of copies or users that can access input data.
- Output privacy risks: this is the risk that some personal information remains in the output data that is shared with the Output party. It is linked to the types of processing tasks, how they are handled, and the definition of anonymous information under our attacker model.
Technologies to address input privacy risks
Data transits from the Input Party, to the Staging unit where it is processed, on to the Output Party. Each environment increases the surface of attack: more copies, more risks. Here are the main options to minimize this risk.
Local Processing
Local Processing is an architecture by which the input parties compute tasks for the output parties on their local infrastructure. It is the simplest way to secure input privacy risks.
Pros: Data does not leave the input party, control over processing tasks.
Cons: Requires computing power next to the data.
Trusted Third-party
If data is located across several collaborating parties that do not want to share data among themselves, one can use a third-party.
Pros: Simplest way of combining data from multiple parties
Cons: Data is still copied which extends the surface of attack to the trusted party.
Federated Learning (FL)
It is an architecture by which the learning tasks are distributed across all input party’s data infrastructures. It makes sense when moving data to a centralized agent is not an option and computing locally is possible. FL has been coined by Google for training machine learning models and can be extended to any calculation across many parties. There are several FL flavors depending on whether all information from a given individual is located on a single node (horizontal FL) or whether it sits across different nodes (vertical FL). Even if actual implementations of vertical FL exist, it is still very experimental and beyond the scope of this article (here is a good description of FL types).
Pros: No copies of the data are necessary
Cons: Requires computing power next to data. Added complexity in calculation (model biases, bandwidth, data availability)
Secure Multi-Party Computation (SMPC)
It is a protocol for many parties to split a computation between themselves so that the computation is spread across all parties and no one sees other parties’ contribution. The output is eventually reconstructed for the output party. It protects data between collaborating but non fully trusted parties at the cost of a high computing and bandwidth overheads.
Pros: Same as horizontal FL, plus the protection of individual contributions
Cons: Large CPU and bandwidth impact that grows with task complexity. Same complexity in managing distributed learning tasks as in FL.
Trusted MPC Consortium
An interesting combination of SMPC and a trusted party is the case of a trusted consortium using SMPC. The data is split across all the consortium participants. The participants cannot learn anything on the source data as long as all the pieces are not recombined.
Pros: Data from each input party is well protected, processing tasks can be hidden to the input parties
Cons: CPU and network impacts grow very fast with task complexity. Hiding the task makes it hard to control output privacy. If consortium members collude they can reconstruct sensitive information.
[Single-party Homomorphic Encryption (HE)]
It is a cryptographic solution that allows computation on encrypted data. The main advantage of HE is to run a computation without revealing the processing task to the data owner and the data to the computing party. It allows protecting a neural network that has cost millions to train for instance. In our case, it makes little sense to let a third party run a secret computation on the original data and share the output with them (the output may be a simple extract of the data!).
Technologies to address output privacy risks
Protecting input privacy says very little on whether the information that is eventually sent to the Output Parties contains personal data. This is even more critical since output parties can be the ones defining processing tasks.
Addressing output privacy risks means ensuring output data is anonymous with regards to our attacker model. The best definition of anonymous information at our disposal is Differential Privacy because it makes the smallest set of assumptions on the attacker, which makes it universally applicable. Weaker definitions (k-anonymity, absence of PII) can be acceptable for weaker attacker models. Their main blind spot is when Output Parties may combine output data with auxiliary information.
Here is a list of solutions to address output privacy risks.
Differentially Private Mechanisms (DP-mechanisms)
It is a set of algorithms that implement the Differential Privacy formalism in a sense that they give a measure of the privacy risks with regards to the differential privacy definition. They can guarantee their outcomes does not allow significant information leakage on individuals. This is the gold standard of output privacy protection as it does not makes the smallest set of assumptions on the attacker. Theory requires the addition of some randomness into the output.
Pros: Privacy guarantees irrespective of processing tasks, control over the privacy/utility trade-off
Cons: Perfect accuracy is forbidden by theory; adds computational overhead
Data Masking (DM)
It encompasses a set of techniques to blur, substitute or remove parts of each record in order to make individual re-identification harder. By substituting some values by more general ones (eg: full deletion, truncation, or average values), it can make aggregation easier. It does not implement a rigorous definition of anonymous information but can be sufficient under certain attacker models (for example with reasonable trust in the output parties or high levels of aggregation).
Pros: Easy way to remove trivially identifying information and mitigate risk
Cons: Altered data may lose utility (especially for IA models); no rigorous theory which means compliance and legal may need to step in for any new projects
Synthetic Data
A methodology to generate a new dataset that mimics the statistical properties of the original dataset. If applied before staging, the output data is protected as well. Interestingly the original data may be deleted. Synthetic data itself may be generated by a DP-mechanism in order to follow a formal definition of anonymous information. If synthetic data has not been generated with a DP-mechanism, it is highly likely that some personal information may persist though it can be hard to extract.
Pros: Original data is not copied and may even be deleted. Protection does not depend on processing task
Cons: Accuracy can be affected with no way to benchmark against original data
A workflow to define the right privacy architecture
To find the best PET for your needs, we recommend you follow some steps.
1/ Define the learning objectives
Knowing what tasks should be supported is important to assess the protection to put in front of it. The more varied the tasks, the riskier. For this reason, it is natural to not allow more than what is expected (eg: delete unnecessary data, forbid unexpected tasks). The list of tasks may include a finite list of questions (preset SQL queries), or much bigger families (all SQL queries or all machine learning training jobs). The more diverse and unknown, the more risks we will need to address
2/ Define your attacker model
We cannot protect against every possible breach for as long as there is data. But we can define a reasonable set of attack scenarios to protect against. It is intimately related to who the involved parties are and whether we trust between them.
Since data will be shared with output parties, often based on processing tasks they define, it is essential to list the actions they may carry out (or the trust we have they will not try them). Even a single aggregation query can leak personal information. Two queries can make it trivial to triangulate someone. More complicated processing tasks such as training a neural-network make it harder to assess risk and often lead to a lot more information being shared (think the 175 billion weights in GPT-3 model. This increases the risks exponentially. Do we assume Output Parties can combine outputs with information they hold already? that they may collude to triangulate users? Are we sure they will not submit nefarious learning tasks? Or do refrain from making assumptions and want to be protected against any query combined with any auxiliary information?
3/ Combine the solutions to defend against such attacker
With a good idea of the objectives and what could go wrong, we can review the solutions at our disposal to design the best privacy architecture. The combinatory of situations would not fit this blog post but let’s give some high level guidelines.
Implement standard practices first: Before looking into advanced attack scenarios, make sure to implement standard practices: data minimization (input only required data), data governance and access control (no unnecessary access rights), auditing trail (ex-post checks can be a sufficient deterrent with honest but curious output parties).
Pick the simplest way to address threats from your attacker model: make sure resorting to an advanced set-up is justified by threats from the attacker model as more complexity also brings new risks in. A centralized repository is the simplest way to go, make sure moving away from it is worth the complexity.
When in doubt about the attacker model, stay on the safe side: Many aspects of the attacker model are hard to define precisely. If in doubt stay safe and opt for the robust approach. This principle is especially important for output privacy. It is often impossible to assess what kind of auxiliary data an output party may have access to. The gold standard to protect against those is Differential Privacy. If you fall back to traditional anonymization techniques, make sure you do so with a clear understanding of the outstanding risks.
Conclusion
Different privacy-preserving technologies address different risks. When data is used for learning objectives (data science, analytics, statistics), privacy risks fall into two categories: input privacy risks (the input data is illegally accessed during processing), output privacy risks (some personal data remains in the output that is shared with the learner). Input privacy risks are easy to address when data can be processed in a trusted central repository. They become very challenging when bits of sensitive data need to be joined from multiple locations in order to carry out the learning task.
Just like it does not make sense to put burglar bars on your windows if you leave the door open, it does not make sense to invest in fancy input privacy technologies if you let personal information leak out via the output. While illegal access to staged data is a possibility, access to output data is a certainty. Differential privacy is the gold standard to assess such risks but it is a very demanding framework. Weaker approaches may suffice for weaker threats.
Here are the steps you should follow when planning on using personal data for analysis or machine learning:
- Define your goals: list the learning objectives you want to achieve;
- Define the threats: have a clear understanding of the actions that may be undertaken and how such actions can lead to information exposure;
- Build your defence: make sure each threat meets the appropriate block.
Privacy-enhancing technologies have made significant progress bringing more security in even the most sophisticated scenarios. Make sure you approach the process from the right angle.