After a lot of experiments with Retrieval-Augmented Generation (RAG) over the past year, organizations are now considering deployment, bringing privacy and security to the forefront. Indeed, RAG introduces multiple avenues for data leakage. Let’s see what to pay attention to and which are the best mitigation strategies..
Here are the critical points you need to consider.
1. Handling private Information in prompt
What it is: At the very beginning of the RAG application is some input submitted by a user or an automated process. This may include anything, including sensitive information like medical data or identifying information. This information will then be passed on to various services within the RAG application to be processed and even stored in unexpected places.
Why it matters: This information may fall under strict data protection regulation. When this information enters your systems, it may be passed along services creating a spillover effect for privacy risk.
Recommendations:
The primary remediations include:
- Mask any direct identifier when not required down the line
- Have the terms and conditions of the service reflect the expected level of sensitivity of input data and make sure your users understand them
- Apply safeguards to detect any unauthorized prompts
2. Securing the knowledge base
What it is: A RAG application relies on a knowledge base of information to be retrieved upon request. Access rights to the knowledge base may differ between users of the application. This knowledge base may be updated over time and controlling what confidential information it contains may prove challenging.
Why it matters: Information that is in the knowledge base is quite likely to also end up in prompt responses. How so is really hard to anticipate as both the retrieval process and the generation process are highly non deterministic. If unaddressed, the RAG application may funnel the entire knowledge base out through this process!
Recommendations:
- Apply access control policies to all elements in the knowledge base
- Mask any sensitive information before computing its embedding for semantic indexing into the knowledge base.
- Possibly apply guardrails to filter out prompts that show suspicious intent
3. Protecting training data
What it is: At the core of a RAG based application is a large language model. If off-the-shelf models do not include private information, it is possible to fine-tune them to adapt to new tasks or improve performance (eg: using the user user feedback). This is done by feeding the model with new data and adjusting the parameters. The model will end up learning something new from the data it has been exposed to.
Why it matters: Once data has been memorized by the model, there is no known methodology to take it out. Someone with access to the model may be able to extract large portions of the fine-tuning dataset. This data may be visible in unexpected ways in the output. If done carelessly, respecting regulatory requirements such as the right-to-be-forgotten may not be possible anymore.
Recommendations:
- Remove directly identifying information from the fine-tuning dataset
- Apply differential-privacy during the fine-tuning process
4. Overall compliance
What it is: A RAG application contains a lot of moving pieces. It manipulates information from the prompt, from the private documents, and potentially from the fine-tuning stage.
Why it matters: The data that flows in a RAG-based architecture can be for a wide range of tasks. Every step along the way raises questions of security and compliance that should be addressed individually. What is allowed may depend on the terms and conditions of the application itself, of the vector database, of the provider of the LLM model.
Recommendations:
- Address both data security and compliance questions of each individual step and of the application as a whole.
- Apply guardrails that enforce the compliance constraints from each stakeholder in a zero-trust approach
By taking these critical steps, you can significantly mitigate the privacy and security risks associated with deploying Retrieval-Augmented Generation (RAG) applications. Ensuring robust handling of private information, securing the knowledge base, protecting training data, and maintaining overall compliance will help safeguard your users' data and build trust in your system.
However, navigating the complexities of privacy and security in RAG implementations can be challenging. If you need expert guidance or have specific questions about securing your RAG applications, don't hesitate to reach out to us at Sarus. Our team of experts is ready to help you ensure your deployment is both secure and compliant.
Contact us today to learn more about how we can support your RAG projects and safeguard your data. Let's work together to create a safer digital environment for your organization and users.