Recent advances in generative AI have brought a lot of enthusiasm in the community as new open source models such as Llama2 can achieve much better performances as compared to the previous available ones..
The question of how much models are able to memorize the data they see through training is non-trivial and people are looking into it currently in the literature.
The goal of this technical post is to provide some insights about the capacity of LLMs like Llama2 to memorize during training and the consequences for privacy.
I. Experimental setup
To mimic a real-life situation, we choose a rather technical dataset where we would expect that fine-tuning a LLM would bring strong performance gains.
We use a dataset for pharmacovigilance comprising annotated events from medical case reports and biomedical literature. It is designed for biomedical event extraction tasks and has 2897 samples.
We train either GPT2 with 124M parameters or Llama2 with 7b parameters for a few epochs with the AdamW optimizer (lr=1e-5) in batches of 16 examples.
To study how LLM’s memorize examples, we add an additional example that does not belong to the dataset. While the order of the other examples changes at each epoch, this one is always seen at the last step of each epoch.
We study the evolution of its perplexity across the whole training. In short, perplexity is a metric to measure how much a LLM considers that a sentence is likely, see here if you are not familiar with this notion.
In order to investigate memorization, we use two different examples across our experimentations: one is semantically close to the dataset samples (“John Doe suffers from a severe case of pancreatic cancer”), we will refer to it as `medical example` and the other is a complete outlier (“I am going to watch Oppenheimer next to Bastille”) we will refer to it ‘outlier’ .
II. Some observations from GPT2
Let us first look at GPT2. In the following graph we plot the perplexity of each example versus the training step. The dashed lines correspond to the step before the model sees the example during training. Each experiment is repeated multiple times (~5 per setup) and in all the graphs, we show the mean perplexity and the confidence intervals at 95%.
Key observations:
- The medical example has a much smaller perplexity as it is semantically related to the other examples of the dataset.
- Before the example is ever seen by the model (i.e. up until the first dashed line), we observe that for the medical example the perplexity decreases while for the outlier it increases: this again is understandable because the model is specialising on the samples, so it generalizes well in the first case but does not expects the outlier.
- When the example is seen in the training set there is in both cases a strong decrease in perplexity. After that, the perplexity is pretty much constant until the example is seen again, so that we obtain a step curve .
Interesting remarks:
- The minimal perplexity is not reached at the very next step after the example is seen but rather around 20 steps later.
- We might expect the model to forget the outlier as it sees more and more medical examples but this does not seem to be the case.
In fact, these two observations are due to the state of the Adam optimizer: the momentum keeps the information about the example that is seen for more than a single step. To prove that, we plot in the next graph the same curves obtained when we replace Adam by plain SGD.
As expected with SGD, the perplexity is noisier, but we can still see, for the outlier example, that it decreases only at the step where the example is seen and then re-increases.
Altogether, this reinforces the idea that Language Model’s learn specific examples of the training set as already explicit in the literature. Let us now focus on the hot topic: how does that change when we switch to Llama2?
III. Llama2 learns more and faster
We plot the same curves for Llama2 and GPT2:
- Llama2 only needs to see the example 4 times to reach a perplexity of 4 and 9 for the medical/outlier case respectively. This is almost 10 times less than the corresponding GPT2 value.
- Llama2 specialises more than GPT2: at the beginning of training for the medical example, the perplexity slightly increases and then is constant suggesting that it is already separating this example from the rest.
- For both models, the example is partly memorized as the perplexity is constant in the next steps when different samples are seen.
Altogether, this shows that Llama2 is a powerful learner, capable in a few shots to memorize almost by heart the example it sees. This poses a serious threat as it could cause privacy leaks: somebody authorized to get the model weights could easily find if some sentences were in the training set!
At Sarus, we provide a way to solve this issue by training the model with differential privacy providing a strong mathematical guarantee that this privacy leak will not happen. The good news is that with such big models as Llama2, it is possible to have great performance! Check out our related blog post.