There is often some confusion on whether the processing of personal data yields personal data or not. Here, we propose a colorful guide that helps study most situations.
Rainbow data and magic wands
We will use rainbows for anything that does or may contain personal information. When we are sure there is no personal data in an object, we will use plain white.
Arrows represent data transformations. It can be rainbow-colored if it encodes personal data. For instance, looking up a patient database and attaching the vaccination status to each row of the input data is definitely rainbow-colored.
Now we can build a very simple logic:
It’s not that rainbow outputs always include personal information but it is quite likely that there are traces of the input in the output. So unless one proves otherwise, considering it personal is the most reasonable assumption.
Luckily, some transforms can guarantee that whatever the input, there won’t be personal data in the output. For more on it, you to check out Differential Privacy literature (a good intro). For now we just need to know that such transforms do exist. We represent them with a magic wand. We can then extend our logic with the following rule:
Let’s use those simple tools to analyze a few common scenarios.
What color is an AI model?
An AI model is the result of a complicated transformation of input data during the training phase. By default, it should be considered rainbow-colored: there are many known examples of membership attacks on AI models.
If the AI model has been trained with a magic transformation (e.g. differentially private libraries like Tensorflow Privacy or Opacus), we have the guarantee that the model does not reveal personal information.
This helps consider the question of whether AI models belong to the users. If the AI model was not generated with privacy protection, this claim is very natural. But if the model was trained by making sure it did not depend on one individual, it becomes harder to argue. In a sense, the model is “true” irrespective of each individual just like medical research is “true” beyond the patients that were enrolled in the clinical trial.
What color is personalization?
Personalizing the user experience, whether it is a medical treatment or displaying invasive ads, is a process that takes personal data as input and outputs a recommendation, typically using historical data from individuals. The process can be an AI model or something much simpler. But either way, the output is going to be personal, we’ll paint it rainbow! When a department store starts sending coupons for baby items, it clearly reveals what is known about the individuals.
What color is synthetic data?
Synthetic data is typically generated from an AI model trained on personal data. It can be rainbow-colored or not depending on how the training is done. The final process of generating data from this model requires inputting random values (clearly not personal) and feeding them into the model to output synthetic records.
The resulting synthetic data will therefore be the same color as our transform.