Introducing the support of relational tables
This new feature enables data consumers to leverage a data source made of several relational tables. Now, a data owner can grant access to an entire database without having to prepare a flat extract or apply any kind of data masking strategy. The data consumers are now able to leverage the full depth of the database, all that with privacy guarantees.
Preserving privacy in relational data is a tough nut to crack
Intuitively, protecting privacy means that an individual’s information is kept secret. Differential privacy gives a more actionable definition of this but the intuition remains. In a table where one row is an individual, the objective is clear: the information of one row should not be leaked. But in relational databases, this is the exception and not the rule. To know how a row relates to an individual, one should parse the graph of foreign keys between tables to link each row of one table to a row of the user table. This is additional complexity already, but it is not the end of our story: the more foreign keys there are, the more likely the same individual corresponds to many rows in the table of interest. So now, we no longer need to make sure that one row is kept hidden, but also need to protect the block of all rows from one individual. And if things were not complex enough as it is, we also want to protect blocks of possible rows that are not in the data but could have been and, should they have been in the data, would have revealed user-level information.
So for Sarus, moving to multi-table support does not just mean that we are able to look up the name of the tables and adjust the SQL queries, it means that for every query or machine learning code that will run, we are able to assess the sensitivity of the output to adding or removing an individual to the database, including all the related rows in all the tables!
Luckily, differential privacy theoreticians have built the right framework to handle such complexity, it was time to implement it and make it user-friendly.
What happens when onboarding a dataset with multiple tables with Sarus
When the data preparator onboards a dataset from a SQL database with multiple tables, the Sarus app retrieves the foreign key relationships. If the source does not support foreign keys (eg: BigQuery, Redshift, flat files), the data owner can define them manually.
During the onboarding process, Sarus will automatically generate a synthetic data version of the dataset. For multi-table datasets, it means both generating a fake table for each table and making sure the foreign keys are preserved in terms of relationships occurrences distribution. This way JOIN queries on synthetic data behave exactly like the same queries on the real data. As a result, analysts and data scientists get a real sense of the relational source.
Now the multi-table dataset is ready to be used. Let’s see it at work!
ML on relational tables with Sarus Private Learning SDK
Let’s use a (public) patient dataset (OMOP) that has 7 tables. The data source is stored in a Postgre database and includes primary and foreign key constraints. We want to protect the privacy of each patient, their information is stored in the person table and all other tables point to the primary key — person_id — of this table.
Once onboarded with Sarus, we can manipulate this relational dataset with the Private Learning SDK (See this article for a full introduction to the SDK). You can find a Colab with the full analysis here.
Let’s extract the view we’re interested in:
We could also have built the extract in pandas. The synthetic rows of data, that preserves the relationships, helps a lot to do so!
Then we can preprocess the data and fit a ML model just as usual. Except the patient data is fully protected!
With this new relational data feature, data owners can grant access to entire databases and let the data consumers build the relevant extractions themselves. Data consumers get a real sense of the source data even without directly accessing it. This is a new step towards Sarus mission: let data lovers work on any data asset with privacy guarantees.
Want to try this feature in just a few minutes? Reach out!