Following a recent and successful attempt at using OpenAI fine-tuning API to generate synthetic data, we wondered how far we could get with a smaller open-weights model such as Mistral 7B Instruct.

The task of fine-tuning a model, is more complex than using OpenAI API, but still relatively simple, thanks to Mistral’s open-source fine-tuning code.

Overall, the results we obtained were a bit less convincing but still impressive given the relative size of the model (7B parameters vs 175B).

Quick reminder of the goal

As described in our previous post, we want to generate synthetic time-series of hourly electricity consumption. Generating such synthetic data with a fine-tuned GPT-3.5 gave excellent results:

Percentiles (10%, 50% and 90%) accross all 100 original series (blue) vs generated series (orange)

We want to realize the same task with Mistral-7B.

Setup of the experiment

The data is organized in JSON messages in the same way they were for the previous experiment:

{
  "messages": [
    {"role": "system", "content": "Given a meter ID, you return a series of hourly consumptions given as a json string."},
    {"role": "user", "content": "{\"item_id\": \"MT_001\"}"},
    {"role": "assistant", "content": "{\"consumption\": [-0.194, -0.089, -0.194, -0.089, -0.141, -0.141, -0.089, -0.194, -0.456, -0.036, 1.751, 0.7, 0.122, -0.562, 0.7, 1.068, -0.509, 1.015, 1.699, -0.194, 0.437, 2.172, 1.12, 0.7, -0.246, -0.194, -0.246, -0.194, -0.246, -0.299, -0.351, -0.351, 1.383, 1.909, 1.856, 2.434, 2.382, 1.699, 2.645, 1.856, 2.855, 0.647, 1.12, -0.246, 0.858, 0.963, 0.279, 0.91, -0.194, -0.246, -0.194, -0.246, -0.351, -0.299, -0.299, 0.174, 1.856, 1.804, 1.699, 2.014, 2.014, 0.7, 1.015, 2.014, 2.96, -0.036, 2.224, 0.332, 0.858, 2.066, 0.858, 1.331, -0.089, -0.036, -0.141, -0.089, -0.141, -0.194, -0.194, 0.174, 1.751, 2.224, 2.54, 2.172, 1.856, 2.277, 0.437, 0.332, 1.751, 3.328, 2.277, 0.174, 0.069, 2.277, 0.647, 0.91, 0.91, 0.91, 0.752, 2.066]}"}
  ]
}

To run the fine-tuning task, we provisioned a virtual machine on AWS with 4 A10 GPUs (g5.12xlarge instance). Then followed the instruction from mistral-finetune with the following parameters:

batch_size: 1
ckpt_freq: 180
data:
  data: ''
  eval_instruct_data: mistral_finetuning_test.jsonl
  instruct_data: mistral_finetuning_train.jsonl
eval_freq: 9
log_freq: 1
lora:
  rank: 64
max_steps: 180
model_id_or_path: mistral_models/7B_instruct
no_eval: false
optim:
  lr: 6.0e-05
  pct_start: 0.05
  weight_decay: 0.1
run_dir: mistral_run-2024-06
save_adapters: false
seed: 0
seq_len: 2048
wandb:
  key: xxxxxx
  offline: false
  project: arena-tests
  run_name: run-2024-06

Finally, we generated new time-series with mistral-inference and generated 200 new series. Descriptive statistics (percentiles, cross-correlation, etc.) show the following:

Xy-plot of consumption at hour 0 vs hour 24 for original data (blue) vs generated series (orange)

The distribution of values is strongly shifted toward lower values, but the cross correlation seems to be relatively faithful.

Conclusion

Fine-tuning based synthetic data seem to show acceptable statistical properties even when using a relatively small model.

Of course, although quick tests show none of the original data can be found in the generated dataset, there is no privacy garantee with this approach. For very sensitive data you should rely on formal guarantees of privacy such as Differential Privacy for your synthetic data. Sarus Technologies provides a service for LLM fine-tuning with differential privacy.

If you want to play with this experiment, you can find the code there.

This post is one in a series of posts on AI and privacy. How to use AI and in particular commercial LLMs (for in-context learning, RAG or fine-tuning) with some privacy guarantees but also how AI and LLMs can help us solve privacy challenges. If you are interested in knowing more about existing AI with privacy solutions contact us and try our open-source framework: Arena (WIP).

See also:

‍

Generate Time-Series Data with Fine-Tuned Mistral 7B Instruct

Quick reminder of the goal

Setup of the experiment

Conclusion

About the author

Nicolas Grislain

Ready?

Subscribe to our newsletter

Sarus tech

Resources

Company