Data is powerful because it reveals how people think or behave. Whether you analyze business trends or train an AI model, the bigger and deeper the data, the more valuable the results. But as the opportunities for innovation increase, so do the privacy risks with costlier breaches and stricter regulations.
To reconcile innovation and data protection, a new field has emerged: Privacy-preserving Learning. This is a set of theoretical frameworks and technologies that aim at developing AI and analytics solutions while preserving privacy throughout the data science workflow.
We will explore some of these frameworks, explain how they protect privacy, and present how Sarus offers a novel approach to achieve privacy-preserving learning.
The goal of Privacy-preserving Learning
Privacy-preserving learning aims at realizing two seemingly contradictory objectives: learn all sorts of insights from data while not revealing the event-level information the data is made of. To understand it, we need to better define what needs to be kept secret and what can be shared. It is helpful to think in terms of personal information — that relates to one individual — and general knowledge — that is true irrespective of the addition or subtraction of each individual.
To preserve privacy, information that relates to individuals must be protected. On the other hand, general knowledge can be extracted. Privacy-preserving learning is possible because machine learning is precisely about learning patterns that apply to individuals in general but not specific to one individual in particular. For example, one might want to learn if there is a correlation between smoking and cancer without having any interest in whether a given participant smokes or has cancer.
The goal of Privacy-preserving Learning is to enable the acquisition of general knowledge while striving to preserve personal information. These two objectives are not as contradictory as they might initially appear.
Protecting data throughout the data journey
When it comes to data protection, as with any security objective, one is only as protected as the weakest link. So efforts are made to protect personal data everywhere. A simple learning workflow has two steps: staging and sharing results. Solutions have emerged to address data protection risks in both.
Staging phase: how to provide Input Privacy
The staging phase consists in creating an environment for the data practitioners to work on data. The property of protecting personal data in the staging process is referred to as Input Privacy.
A common approach to securing this flow is to turn the original dataset into a less sensitive version using Data Masking techniques. However it has many known limitations: it lacks a way to assess data protection, it is hard to use with rich or unstructured data, and expertise is needed to define data masking rules for each use case.
To further minimize risk, the data owner could set up a remote execution environment so that data scientists work without accessing data directly. It avoids copying the data and exposing the whole dataset to the data scientists which is the source of many leaks. With a remote execution framework, only what is learned is shared with the data scientist. When the data is spread across multiple locations (eg: personal devices, hospital servers), then learning remotely is often referred to as Federated Learning.
Following a similar logic, the staging phase could implement cryptographic techniques to allow learning on encrypted data. This way the data practitioners no longer see the original data that is used in their computations. There is a whole field of cryptography research which focuses on computation on encrypted data with techniques like Homomorphic Encryption or Secure Multi-Computation. Encryption is especially useful if the computation cannot take place where the data was originally located. It allows moving data without the associated risk.
At Sarus, we believe that the ability to learn remotely is a must-have to overcome the limitations of data masking. We will always make sure that the data stays safe and where it belongs. Because our clients have computing power in their data infrastructure, they do not need to move data and are able to compute in clear without the need for complicated encryption layers.
One thing to keep in mind is that learning remotely or on encrypted data does not guarantee against leakage of personal information. The output of the computation could be as sensitive as the original data (it may focus on one individual or even be a copy of it). To address this risk, we need to focus on the second step of our flow.
Sharing results: how to provide Output Privacy
Whether the output of the computation is the response of a query or the fruit of a full machine learning training, it may reveal personal information. Making sure that the output protects privacy is referred to as Output Privacy.
Historically, researches have resorted to simple heuristics to ensure that outputs no longer contain identifying information. Techniques like aggregating with a high enough threshold, applying k-anonymity, or l-diversity provide some benefits but all have well documented weaknesses. The main weakness is that they assume an attacker has limited background knowledge, which becomes less and less tenable with the profusion of public information available on individuals. Also, they become impractical when the output data has a high dimension.
Recent research in mathematics provides us with the right framework to address this: Differential Privacy. Implementing it guarantees that the output of a calculation does not reveal significant information on individuals.
All prior protections may be worthless if the output is not safe. At Sarus, we built Differential Privacy into the core of the engine so that any result that is learned is privacy-protected, irrespective of the type of data or calculation.
Building the foundations for safer and more efficient data innovation
At Sarus, we believe that data science workflows should address privacy concerns across the entire data journey. Resorting to a robust mathematical framework is indispensable for both safety and scalability. Also limiting the number of people who access sensitive data is key to reducing risks. It is not so much a question about trusting an internal data science team but the only way to enable leveraging data across business lines and countries, or collaborating on data with external partners.
We also want to weave privacy preservation into existing data infrastructures and workflows without having to radically change how data is harvested, stored, and analyzed today.
This is why we are building a versatile Differentially-Private Remote Learning platform. It fully addresses both input privacy and output privacy in one go. Sarus is installed next to the original data and data practitioners can work on any dataset seamlessly through the Sarus API. They interact with remote data just like they would with local data, except that they do not have access to individual information. As a consequence, all models or insights are provably anonymous.
Sarus both accelerates innovation and improves data protection practices. Companies can work with external partners such as AI vendors or consultants without taking the risk of leaking personal data out. It creates new collaboration opportunities on sensitive data for innovation or research. There are numerous applications in industries where data is highly protected such as healthcare.
The use of data for AI applications will grow exponentially if data can be leveraged across departments, borders, or companies. But in order to get there, it cannot be the data that travels, it must be the general knowledge which is also where the real value lies.