Companies like Google or Facebook have a fabulous user base with tremendous depth on each individual. It allows them to build extremely powerful AI models that predict user intent (search, navigation, purchase intent…). They can even benefit from data’s increasing returns to scale (see discussion by Stanford professors).
Such concentration of data raises monopoly concerns and compounds privacy risks. Smaller organizations and policy makers (like the European Commission) have looked at data sharing as the best way forward.
Can organizations lacking data depth and breadth achieve the same thanks to data sharing arrangements? how about in a privacy-preserving fashion?
Let’s find out by looking into the various ways of doing privacy-preserving data sharing, their main applications and limitations.
The various forms of data sharing
Data sharing is a transaction where a Data Owner sends information to a Learner. We’ll say it is privacy-preserving if this information is anonymous in the sense that it cannot be linked to individuals anymore (for a less naive definition, the curious reader may read this post). The sharing of pseudonymized data is therefore not considered privacy-preserving.
Case 1: Sharing insights on owner’s data is the traditional way
The simplest form of data sharing is when the owner produces anonymous information to be sent to the learner. It is most of nowadays’ data sharing and includes statistical aggregates, synthetic data, or even updates to a machine learning model’s parameters. But it can be prone to misinterpretation and confounders because insights on the owner’s side may not be applicable to the learner’s user based. Below, we’ll see how to mitigate those risks thanks to combining data before sharing it.
ML potential and caveats: the value of one data points is sometimes considered to grow with the dataset size (see Weyl and Posner argument). But the model features need to be available in production otherwise the learner will struggle to leverage the new knowledge. The learner will be most interested in datasets that have the same features as theirs, looking for a way to make their models more robust and more general. When the source’s dataset and the learner’s own data are quite consistent, it is a powerful accelerator of AI innovation.
Case 2: Sharing insights on combined data unlocks a lot more value
If the insight is anonymous, it can no longer be matched to individuals by the learner. But the owner’s and learner’s data could be matched before being shared to build anonymous insights based on the combined data. It would still be anonymous data sharing but with a much greater potential.
For example, if one party has a survey on where people live and another party on where people work. Combining each survey yields little insights on commuting patterns, whereas combining each source data first grants access to the full fidelity insights.
Implementation: There are many possible implementation to achieve anonymous data sharing on combined data. It ranges from one party hosting the data, to trusting a third-party, all the way to sophisticated cryptographic schemes (e.g. SMPC). However, it is very easy to craft a dataset with just one real individual and combine it to the owner’s data to extract that user’s personal data. To make sure the insights are effectively anonymous, the owner will want to enforce output privacy protections (eg: differential privacy or at the very least filter the queries that can be run).
ML potential and caveats: For machine learning, more features means more powerful models. It is much easier to make a diagnosis given all patient information than from a handful of symptoms. But there is a catch, if the model takes joined data as input, it will probably require joined data in production as well. Polling the data in production in a privacy-preserving way is effectively retrieving personal information so this may be a strong limitation of the applicability of models trained on joined data. A notable exception is when the owner’s data is what the model is trying to predict (for instance joining the patient data with an external system with the result of medical analyses that is to be predicted). The target is obviously not expected to be available in production so this does not introduce any limitation. It is a very powerful way of building strong ML systems based on combined data.
Conclusion: Privacy-preserving data sharing is here, make sure you don’t miss out on the opportunity!
Privacy-preserving data exchange is a powerful way to exchange value between organizations without revealing personal information. The potential is exponentially higher when data can be joined before extracting the insights. Working off of joined information is what gives the Google and Facebook of the world a strong advantage: their data is already combined.
To extract as much value from data as the Big Tech do, organizations will have to leverage combined data, not just sharing anonymous insights. With modern privacy preserving tools, this is becoming possible, it’s time to join the future!
—
Sarus make it seamless to extract provably anonymous insights (statistics, model updates or synthetic data) from datasets or after joining them. It is the perfect solution when data sharing is part of your strategy.