Why anonymizing connected car data is so hard!

The anonymization challenge in car data

As the automotive industry continues to evolve, they generate ever more data, reaching 8Gb per car daily in 2023, according to Ekimetrics. Some of this data presents unprecedented opportunities for research and enhanced user experiences, and a potential source of billions in revenues. This data holds significant monetization potential ($250 to $400 billion in additional revenue generated from services and data sales, as well as cost savings enabled by car data, according to McKinsey), allowing companies to understand user behaviors and optimize marketing strategies.

However, leveraging car data poses important privacy risks, as it encompasses highly identifying information such as GPS locations and trip details, which are almost impossible to anonymize. Exploring such data responsibly requires robust privacy measures.

In this post, we will explore why anonymizing car data is so hard and propose an alternative framework to get the best of privacy and data potential.

‍

The endless quest for car data anonymization

A typical connected car dataset includes details about the car, its usage, and information about every trip (GPS traces, usage of the car OS…), and possibly about the car owner too. Before putting this data to work, data custodians usually start by trying to anonymize it… as best they can. Let’s take a look.

The obvious start is to remove direct identifiers: driver details, license plate numbers… This is obviously not enough as some details about the car may be unique (eg: combination of car make, color, options), especially when combined with a tiny little bit of information about the location of the car (eg: the zip code of most trips). One needs to go a little further. An improvement is to seek k-anonymity of each of those combinations (i.e.: there should be at least k cars sharing the same combination). With k typically greater than 10, this considerably limits the granularity of car-level information that can be included (details about the car options, the color, or the purchase date may have to go). Unfortunately, this is still far from enough.

The next obvious pitfall is trip details: they trivially reveal the home address of the car owner, probably their work address too. From there, re-identification is a piece of cake.

‍

‍

The data custodian may decide to remove the beginning and the end of each trip, which makes re-identification a bit harder. Yet, there will be many outstanding locations in the middle of the trip that would reveal who is behind the trace (eg: dropping off kids at school or stopping by the bakery on the way back from work). And it’s unfortunate because the home address or work location are extremely valuable for designing the mobility master plan of a city, this is now out of reach with our still-not-so-anonymous data.

To continue in the same direction, one may decide to unlink all the trips, remove the spatio-temporal precision of each data point or any other alteration technique to make the source dataset more innocuous. Unfortunately, none of those methods is bullet proof as they are mostly based on intuition. With sufficient external knowledge, theory tells us re-identification remains possible.

Even if a trip is reduced to one point per zip code that the car has been in and the day, there will still be for sure unique combinations of zip codes that only one car has visited. Knowing that a car went to a few zip codes would be enough to re-identify the car. And this information may even be public on social media already!

‍

‍

Those practices certainly don’t provide mathematical anonymity. They may still be considered anonymous under a specific data protection regulation. For this, the data custodian has to demonstrate that such external knowledge will never be available to someone having access to the dataset. This is certainly fallible and extremely tedious. The worst of it remains that a lot of the data has been lost along the way!

‍

Just sharing anonymous insights is a lot easier!

It can be interesting to know how many cars went from A to B on consecutive days. There is no way to build an anonymous dataset for every possible location and every car but the count itself is most likely fine to share. Can’t we just skip the hard part?

A powerful way to avoid this pitfall is to never share the data with anyone and focus on sharing anonymous insights. This turns out to be both a lot safer and more valuable: the full potential of the dataset can still be extracted!

Sarus offers a privacy-preserving layer that allows analysts to execute queries on data that they don’t have access to. All results are guaranteed to be anonymous with mathematical protection: Differential Privacy. Source data remains within the data owner’s IT environment where all processing occurs, reducing data breach risks.

This allows for the retrieval of insights equivalent to those from raw data, without the risk of re-identification, opening large possibilities for collaboration with external parties while ensuring utmost data protection.

‍

‍

Benefits of Using Sarus for Car Data

Ethical Data Utilization: privacy considerations do not hinder data utilization, enabling ethical data practices and strengthening customer trust.
Enhanced Data Partnerships: sharing insights instead of raw data facilitates new business models and partnerships.
On-demand Insights: Data scientists can generate real-time, fine-grained insights.

‍

Conclusion

Anonymizing connected car data is almost an impossible problem and gets in the way of tapping the full potential of a valuable resource. Instead of a high risk, low reward approach, data owners can opt to only share anonymous insights on-demand. Sarus’ privacy layer enables just this, while making the experience seamless for all parties.

Sarus is redefining how car data can be used for analytics and AI by ensuring that privacy and security are never compromised. With Sarus, companies unlock the full potential of their data, create new revenue streams, and maintain the trust of their users.

‍

Why anonymizing connected car data is so hard!

The anonymization challenge in car data

The endless quest for car data anonymization

Just sharing anonymous insights is a lot easier!

Benefits of Using Sarus for Car Data

Conclusion

About the author

Maxime Agostini

Ready?

Subscribe to our newsletter

Sarus tech

Resources

Company