Electricity distribution produces a lot of data that could be used to optimize the grid, assess the need for building insulation or renovation, or track sustainability goals of new constructions. But this data often comes from household smart meters that are highly personal and should be used with extreme caution. Sarus partnered with utility company EDF to design innovative solutions that enable data analyses while protecting privacy. The joint work studied how Differential Privacy (DP) can be used to produce privacy-preserving insights while retaining all the utility of the underlying smart meter data.
Our research consisted of designing an attack to assess the robustness of various DP parametrizations. The attack was built using a black-box model based on this research paper. The general idea is to train a machine learning model on a dataset and use the weights of this model to re-identify some data about a particular row in the training dataset. In our case, the attacker was allowed to add a custom column in the training dataset. Adding a custom column is something very usual in deep learning because data must be preprocessed and adapted to special training processes. But this ability also allowed the attacker to single out one household. All the attacker has to do is set the value of the new column to one just for this household, effectively creating an outlier. An outlier is a line that is very different from the others, which is more likely to have a bigger impact on the model weights. Therefore, using weights to extract some information about an outlier is easier. It makes it a good set-up to assess the privacy benefits of using DP during training. The logic of the attack can be described with the following schema.
In our work, we tried to attack several datasets. First of all, to get theoretical results, we built a dummy dataset with only two columns containing bits (0 or 1). With this simple dataset, we were able to find precise theoretical results about the attacker’s maximum accuracy, which is defined as the attacker’s capacity to get the right bit of information he is searching. The closer the accuracy is to 1, the better the attack. Then, we used common machine learning datasets like Titanic, MNIST or Census which contain various types of information (personal information about gender, income, race, images…). On these datasets, we developed inference models, mainly Multi Layers Perceptrons and logistic regressions, and the models were trained using the Tensorflow package which implements the DP-SGD algorithm. Note that the models were trained to do usual machine learning jobs, in this case feature inference, and the attack did not prevent them from doing what they were built for. The following graph sums up the results on these datasets.
Epsilon on the x-axis is a DP parameter which aims at measuring how well the information is preserved. The red curve is the accuracy of the trained model for various epsilon value applied to the training. It shows that the model is able to learn properly even in the presence of the new column added by the attacker. The larger the epsilon value, the less noise is added to the training phase, the better the model accuracy.
The blue curve is the maximum accuracy the attacker can get by attacking the model. The larger the epsilon value, the more successful the attack becomes. We observed ranges of epsilon values where the accuracy of the model is conserved while the accuracy of the attack remains low, which points out that differential privacy can indeed prevent reconstruction attacks while maintaining an acceptable level of utility.
In cases where an attacker can identify a single row, it seems easy for them to extract insights about this individual in absence of differential privacy guarantees. All they have to do is preprocess the data to single out the row and train the model to predict that singleton. Using DP to protect personal data in this case appears to be very effective since the trade-off between model accuracy and privacy preservation is acceptable.
The attack set-up assumes a malicious person knows a household’s consumption habits enough to identify a particular line and that this person is given unconstrained access to the query interface. This may be hard when the data only includes consumption values, but it would be more reasonable if the data includes additional fields (zip code, address, flat number…). In such a case, additional precaution may be necessary to consider trained models as safe. DP provides a powerful framework to protect privacy in this scenario. It provides additional protection even against the most sophisticated attacks. This work proves that it can be an asset to advance research on building energy efficiency and smart grid!