Language models have emerged as powerful tools but they come with strong privacy concerns. Up until today, it was impractical to train them with strong privacy guarantees such as Differential Privacy. But pre-trained LLMs like Llama-2 are about to change the game.
In this blog post, we demonstrate that pre-trained models can be effectively fine-tuned with Differential Privacy with a reasonable protection, i.e. ε < 10. Additionally, we show that ML models trained on synthetic samples generated from these fine-tuned models can attain nearly identical performance levels to those trained on the plain datasets. This opens up exciting new applications in privacy-sensitive industries such as health care, transportation, or energy where it is critical to be able to extract general knowledge from a dataset while protecting private information.
Efficient private learning with public knowledge
Differential Privacy (DP) is a mathematically grounded notion of privacy for statistics, first conceptualized by Cyntia Dwork in 2006 and increasingly recognized as a standard for enhancing privacy in Machine Learning (ML). Training a ML model with DP ensures that no individual data has a statistically noticeable impact on the fitted model. This property guarantees that the fitted model will not leak any private information.
Training a Language model with DP from scratch is extremely challenging. Indeed, without any prior information, the model would have to make sense of the English language based on private data, which is inefficient and totally impractical. In addition, that would require several rounds of gradient descent updates where data is accessed multiple times, while, in the DP framework, privacy guarantees decrease with every round of data access.
A common solution has been to use a LM pre-trained on public data and only fine tune it on the private data. However, until recently, only small LM pre-trained on relatively small amounts of data were available which limited the performances when fine tuned with DP. Fortunately, the recent open-sourcing of pre-trained Large Language Models (LLMs) enriched with extensive public knowledge is making it possible to leverage LLMs with DP in an efficient manner.
Experimental Setup
In our experiments, we employ the IMDb movie review dataset, where each entry in the dataset includes both a positive/negative label and a corresponding movie review. This dataset is readily accessible from the Hugging Face Hub, complete with a predefined train/test split, each containing 25,000 entries.
We consider two families of language models declined in different sizes:
- GPTs from Radford et al. 2019. We select the smallest and largest pretrained versions which have 124M and 1.5b parameters, respectively.
- Llama 2s from Touvron et al. 2023. We select the pretrained versions with 7b and 13b parameters.
Each model is fine-tuned on the training set. Each example of the dataset is pre-processed by concatenating the label and the review, separated by a special delimiter (few characters), tokenized and fed to the model (we limit the sequence length to 300 tokens here).
Differentially Private Fine-tuning
We use a GPU A100 and adopt gradient accumulation during training, working with batches of 1028 examples over a span of 5 epochs (larger batches are better with DP-SGD as the noise to signal ratio is decreased). To apply DP, we use the Opacus library and train with DP-SGD. We set the clipping norm to 1e-2 and maintain a constant noise multiplier to ensure that the privacy budget remains below 10 at the conclusion of the training process. We use the AdamW optimizer with an initial learning rate of 5.1e-4 for all models. It’s noteworthy that training with DP necessitates a slightly higher learning rate due to the strong gradient clipping.
In the case of GPT models, we update the whole model. However, for Llama 2, we adopted the Q-LoRA framework, which involves freezing and quantising all model weights while exclusively training new LoRA parameters.
In the graph presented above, we show the training loss in relation to epsilon (for one LLM fine-tuning, we compute the budget spent at each gradient update, epsilon=0 corresponds to the first loss without any gradient update). Notably, larger models exhibit consistently lower loss values at the very beginning of training. This can be attributed to their inherent ability to generate higher-quality samples. Additionally, we observe a larger loss decrease across the training for larger models. This suggests that larger models possess the capacity to learn more from private datasets during training.
At the end of training, we sample 25,000 examples each, starting either with a negative or a positive label. A cursory examination of these samples reveals that smaller models have only marginally understood that the label pertains to a review. They struggle to establish a clear correlation between the input label and a positive/negative review. Furthermore, some examples diverge from movie references and venture into realms such as books or music. In contrast, samples generated by Llama-2 exhibit a greater degree of coherence and subtlety, with a more robust correlation between the input label and the review’s sentiment.
Evaluation of the synthetic samples
To evaluate these observations more quantitatively, we assess whether these generated samples can be used to train a classifier to predict the review sentiment (0 for negative, 1 for positive) from the text. We train a classifier on the original samples and a different one on each of the LLM’s generated samples, then evaluate how each classifier performs on the same held out validation set of the real data.
For the classifier, we use DistilBert a smaller, faster and cheaper version of Bert. It is 40% smaller than Bert but preserves over 95% of Bert’s performance. We train the classifiers for two epochs with the AdamW optimizer, a learning rate of 2.10e–5 and batch size 64. Below is the evaluation at the end of training: the accuracy and F1 score for each of them on the shared test set.
The result of just 50% of accuracy on the test set for the smallest GPT2 samples confirms that during fine-tuning the model has failed to establish a clear correlation between the label and the review. Remarkably, the most promising result is that the classifier trained with Llama-generated samples achieves nearly identical performance levels to the one trained with real data!
Conclusion
These results underscore one critical insight: recent progress in generative modelling and the availability of open-source pre-trained LLMs make DP synthetic data practical and usable for downstream tasks, without significantly compromising performance. This means that by sharing synthetic data based on DP-fine-tuned LLMs, one can share knowledge on populations from a dataset while preserving individual privacy.
So, if you own private data and struggle to use or share them due to their sensitivity, you should definitely consider DP fine tuning either for synthetic data generation or directly for other NLP tasks. If you lack the expertise or time, Sarus Technologies builds tools to leverage DP easily and remotely to enable you to finally exploit the knowledge in your datasets.