Beyond Few-Shot Learning: How LLMs Excel in Synthetic Data Generation Through Fine-Tuning

Yesterday, a friend mentioned that he’s organizing a hackathon for students focused on machine learning. He has some great data from a retailer and interesting questions about customer purchase patterns. The challenge, however, is that he cannot disclose any personal data, and unfortunately for him high dimensional time-series data cannot be anonymized so easily. Indeed the simplest approaches, such as pseudonymization, are dangerously weak when it comes to time-series, because a purchase trajectory is almost a signature.

Synthetic data can help. Modern generative models can generate records with the same probability distribution as a given dataset. In our case, we are talking about a high dimensional distribution in a time-series space, which is a real challenge.

Because we are time-constrained and we tried to use a public LLM for the task.

In this post, we use a public time-series dataset to show that few-shot learning is not so easy to get right. But that, in contrast, fine-tuning is surprisingly powerful.

Few-Shot Learning from a Thousand Time Series

To test different approaches, we build a time-series dataset from a public dataset of hourly electricity consumption.

import json
from datasets import load_dataset

dataset = load_dataset("LeoTungAnh/electricity_hourly")

with open("few_shots.jsonl", "w") as f:
    for series in dataset["train"]:
        f.write(json.dumps([round(1000*x)/1000 for x in series["target"][:100]])+"\n")

We then build a prompt with 10 examples:

prompt = {"model": "gpt-4-turbo", "temperature": 0.8, "max_tokens": 4096, "messages": [
        {"role": "system", "content": """You are a synthetic data generator, you generate 10 rows similar to the user examples, but not equal.
You output the rows without more comment."""},
        {"role": "user", "content": "\n".join([json.dumps([round(1000*x)/1000 for x in series["target"][:100]]) for series in dataset["train"]][:10])},
]}

We submit this prompt and get 10 generated series. Unfortunately, although not exactly equal, the generated series are almost identical to the ones we input.

After several unsuccessful attempts, some where the output format was not respected, some with shorter than expected series, some with totally random values, and because the length of the context forces me to select only a few example to teach GPT-4 to generate new examples, I realized few-shot learning may not be fit for the task.

It’s time to get back to the good old gradient descent, I will test OpenAI fine-tuning API.

Fine-Tuning GPT-3.5-turbo into a Specialized Time-Series Generator

Based on the same dataset (hourly electricity consumption), we generate a training split and a validation split suitable for the fine-tuning API.

from typing import Any
import json

SYSTEM_PROMPT = "Given a meter ID, you return a series of hourly consumptions given as a json string."

def format(series: Any) -> str:
    return json.dumps({"messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": json.dumps({"item_id": series["item_id"]})},
        {"role": "assistant", "content": json.dumps({"consumption": [round(1000*x)/1000 for x in series["target"][:100]]})}
    ]})+"\n"

with open(TRAIN_PATH, "w") as f:
    for series in dataset["train"]:
        f.write(format(series))

with open(TEST_PATH, "w") as f:
    for series in dataset["test"]:
        f.write(format(series))

We upload the files and launch the fine-tuning job.

from openai import OpenAI

client = OpenAI()

with open(TRAIN_PATH, "rb") as f:
  train_file = client.files.create(file=f, purpose="fine-tune")

with open(TEST_PATH, "rb") as f:
  test_file = client.files.create(file=f, purpose="fine-tune")

fine_tuning_job = client.fine_tuning.jobs.create(
  training_file=train_file.id,
  hyperparameters={"batch_size": 2, "learning_rate_multiplier":2, "n_epochs": 10},
  validation_file=test_file.id,
  suffix="arena",
  model="gpt-3.5-turbo"
)

We can then generate new series:

def format_query(item_id: int) -> Any:
    return {"model": model, "temperature": 1.1, "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": json.dumps({"item_id": f"MT_{item_id}"})},
    ]}

completion = client.chat.completions.create(**format_query(10))

print(completion.choices[0].message)

The output looks rather good:

{"consumption": [-1.507, -1.479, -1.507, -1.526, -1.544, -1.507, -1.562, -1.562, -1.562, -1.661, -1.588, -1.625, -0.811, -1.562, -1.655, -1.643, -1.68, -1.634, -1.592, -1.468, -1.58, -1.497, -1.507, -1.499, -1.544, -1.479, -1.479, -1.507, -1.544, -1.516, -1.269, -0.314, 0.248, 0.4, 0.171, 1.062, 1.025, 0.376, 0.248, 0.233, 0.29, 0.267, 1.283, 1.711, 1.712, 0.844, -0.367, -1.107, -1.101, -1.172, -1.142, -1.072, -0.871, -0.9, -0.594, -0.075, 0.19, 0.018, 0.164, 0.087, 0.201, -0.023, 0.072, 0.026, 0.072, 0.493, 0.95, 1.181, 1.005, 0.729, -0.253, -1.046, -1.107, -1.166, -1.119, -1.119, -1.22, -1.25, -0.526, 0.087, 0.17, 0.026, 0.044, -0.011, 0.07, -0.023, 0.048, 0.072, 0.072, 0.548, 1.018, 0.874, 0.597, -0.429, -1.154, -1.214, -1.232, -1.178, -1.232, -1.179]}

We can check that the generated series are all rather different from the series from the training set

By comparing each generated series to its closest element in the training dataset (closest in the l²-norm sense), we show that all the generated series are original (rather far from their closest element in the training set).

A generated series (blue) and its closest element in the training dataset (orange)

Nevertheless the generated data mimics most of the statistics of the training dataset

By computing basic statistics (mean, median, some percentiles) and some cross-correlations we show the generated series have a distribution relatively close to that of the original data.

Means accross all 100 generated series (blue) vs original series (orange)

Percentiles (10%, 50% and 90%) accross all 100 generated series (blue) vs original series (orange)

Xy-plot of consumption at hour 0 vs hour 24 for generated data (blue) vs original series (orange)

Conclusion

Fine-tuning based synthetic data seem to show good statistical properties

Although this assertion would require much more investigations, it seems from the first quick descriptive statistics we computed that the synthetic data we generated represent quite faithfully the original distribution of data.

To go further we should compare how some typical downstream tasks (regressions, classifications, clustering) behave on both synthetic and original data to confirm this.

You can play with fine-tuning and few shot learning yourself

The notebooks are available on github.

The fine-tuning experiment here
And the few shot learning there

Feel free to improve on this work and let us know what you found.

Our synthetic data does not contain any of the original confidential datapoints, but you need differential privacy for serious applications

We show by a quick similarity analysis that generated samples are all different from those in the original dataset. This is rather reassuring and gives some confidence in the fact privacy is relatively preserved.

Nevertheless, there is no formal guarantees privacy is actually protected. There are actually countless examples of reidentification attacks in the academic literature.

For very sensitive data you should rely on formal guarantees of privacy such as Differential Privacy. Sarus Technologies provides a service for LLM fine-tuning with differential privacy.

This post is one in a series of posts on AI and privacy. How to use AI and in particular commercial LLMs (for in-context learning, RAG or fine-tuning) with some privacy guarantees but also how AI and LLMs can help us solve privacy challenges. If you are interested in knowing more about existing AI with privacy solutions contact us and try our open-source framework: Arena (WIP).

See also:

David vs. Goliath in AI

‍

Quickly Generate Time-Series Synthetic Data with OpenAI’s Fine-Tuning API

Few-Shot Learning from a Thousand Time Series

Fine-Tuning GPT-3.5-turbo into a Specialized Time-Series Generator

We can check that the generated series are all rather different from the series from the training set

Nevertheless the generated data mimics most of the statistics of the training dataset

Conclusion

Fine-tuning based synthetic data seem to show good statistical properties

You can play with fine-tuning and few shot learning yourself

Our synthetic data does not contain any of the original confidential datapoints, but you need differential privacy for serious applications

About the author

Nicolas Grislain

Ready?

Subscribe to our newsletter

Sarus tech

Resources

Company