Marketing segmentation challenge: create value while protecting highly sensitive data
Market segmentation is a crucial marketing strategy that involves dividing a heterogeneous market into homogeneous groups based on certain shared characteristics, such as age, gender, or income. By utilizing highly personal features, marketers can create clusters to target each group with personalized promotions, coupons, and other relevant information. However, accessing these personal features poses significant privacy risks, making them often inaccessible to the marketer.
This demo will show how a marketing team will be able to activate a market segmentation for a cosmetics brand, using data from a grocery retailer, without compromising the privacy of individuals thanks to Sarus.
Implementing market segmentation and activation with Sarus
What is Sarus?
Sarus is a solution which allows analysts to work on sensitive data without seeing it in the process. They can conduct statistical analysis and develop machine learning models to extract insights and make decisions without compromising the privacy of individuals.
We’ll see why this is a secure and efficient way to design a market segmentation strategy.
Retail dataset
Find the data on our github here.
In this demo, we are using a subset of the retail dataset Completejourney — you can find a full description here. The original dataset contains data about household-level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store — we will be using only data of 800 households for whom there are demographic insights.
The demo uses three tables:
The demographics table provides detailed information about the socio-economic and demographic characteristics of households. The transactions and products tables describe the purchases made by households, including the amount spent, when and where they made the purchase, product category, and origin.
Data sensitivity
Considering that “99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes” according to this scientific article, hence the retail dataset we will be using is very sensitive. Moreover it also contains insights about households’ consumption habits that can facilitate even more re-identification.
Dataset preparation
In order to conduct this market segmentation, first the data administrator from the retailer — who collected and owns the data — onboards the dataset in Sarus. During this step, the application automatically generates Synthetic Data with a deep-learning generative model that preserves types and links between tables (learn more about the model in this article). Having a precise understanding of the data is very important for the analyst, and the synthetic data allows to get these precious elements.
The data administrator from the retailer also customizes the Privacy Policy applied to the data scientist from the marketing team, specifying the results they will receive. In our case, they will have access to the Synthetic Data, Differentially Private results and the right to activate ids.
The data scientist now has access to the dataset called ‘retail_data’ and can conduct the market segmentation using Python.
Extract, explore, pre-process and segment the dataset
Notebook intro
You can find the notebook here!
First, the analyst connects to the instance where the dataset has been set up by the data administrator:
The analyst will be using pandas, numpy and scikit-learn to manipulate the remote dataset. For this, the respective APIs from the Sarus libraries must be imported.
Explore the tables
Once the dataset has been selected, the analyst can see all its tables and get samples from each of them.
Note the output message: ‘Evaluated from synthetic data only’. Indeed, the analyst’s Privacy Policy does not allow them to see raw data, therefore the Sarus API returns the best alternative output according to the Policy: rows of synthetic data.
Extract the relevant data with a SQL query and process them with pandas
After doing some exploration on the different tables, the data scientist can create the view they want to work on using a SQL query.
Now that the relevant data have been selected, the data scientist still has to correct the format in order to train a Machine Learning algorithm.
Here we trained a KMeans model to create clusters and mapped each household to the corresponding group.
Activate this list of ids
Now, we’re ready to send the list of ID to a third-party tool.
And it’s done: all the insights have been pushed to the third-party tool, and the digital marketing team can instantly start using them!
Conclusion and benefits
The data scientist, armed with his trusty libraries, was able to run his analytical work in an usual way even without accessing data. They were able to define two specific audiences and use them in their marketing campaigns. The process adhered to the highest standards of data protection, ensuring that no personal information is ever exposed or leaked.