A GDPR Compliant Method to Identify API Clients

When the General Data Protection Regulation (GDPR) came into force a few years ago, many analytics practices became questionable. For example, storing IP addresses for an extended period without user consent now seems out of the question since this is considered to be “personal data.” Such data may be tricky to store if you can’t request permission when users first use your APIs. So, how can you track different API clients in a GDPR compliant way and thereby honor your end-users privacy?

Personal Data

To propose a solution, we must first look into what GDPR defines as “personal data.” Article 4 defines “personal data” as:

“personal data means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”

This definition presents a problem for storing IP addresses. Most of the time, IP addresses can be considered as identifying a natural person, even when a company owns the address. You can’t really know who owns the IP address upfront; somebody could be working from home and initiating the API request from there, which would mean the IP address identifies a natural person.

IP Addresses as Personal Data in APIs

Most IP address storage in logs and analytics tools are used to distinguish different API clients from each other. Let’s say you own a time-tracking Software-as-a-Service business and you provide an API for mobile app builders. Somebody builds a great app that communicates with the API, which helps your product take off. Then somebody else comes along and develops a different app, which becomes even more popular. However, it contains a bug that causes far too many API requests, resulting in scalability problems and server-side errors. Previously you might have used IP addresses to track down users to discover what client they are using, but with the GDPR in place, you’re probably not allowed to do so without the user’s consent.

A Possible Solution

So how can we distinguish the different clients in the time tracking company example above in a GDPR compliant way? Well, why do we want to distinguish them in the first place? To identify the actual app or client the end-user uses to contact the author of the app in case of problems. There is an existing technology for that already: the HTTP User-Agent header:

“The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent.”

To identify the API client implementation without damaging the end user’s privacy, you can require all API clients to use a unique User-Agent. You can then store the logs or your analytics data for as long as you want, and there is no need for user consent either. It might also be wise to require a version number in there so that it is possible to distinguish the fixed clients from the old ones.

So where we would previously accept:

User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 14_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1

We now require something like this:

User-Agent: Time Tracking App by Tom (1.3.1)

If you replace the IP address in your access logs or your analytics products and there is no other personal data left, then you’ll be able to start storing them indefinitely. You’ll be able to compare bigger periods of data than you previously could, and you can even start sending this data off to others for analysis without being afraid of a data leak or violating GDPR.

Identifying Abusive Clients

For some API providers, there is one drawback in this solution compared to IP address storage. The User-Agent header can be manipulated more easily than an IP address, so it cannot identify and block abusive users. If your API requires identification, then this shouldn’t be a problem. You’ll be able to rate limit and block based on user credentials. An API without identification will still need to capture and store IP addresses, albeit for a shorter time to implement rate limits and blocking. However, blocking based on an IP address seems like a failing solution as well, with the increase of IPv6 adoption and the possibility for everybody to allocate and use complete blocks of IP addresses…

Conclusion

It seems impossible to find a watertight solution, but GDPR requires us to modify existing solutions and start honoring end-user privacy, so we’ll have to settle for less. Using the User-Agent header seems like a good enough solution for distinguishing API clients, and I’ll surely start requiring good User-Agent headers for the APIs I provide. If you have solved this problem differently, be sure to comment below, as I’m always interested in solutions for problems that require some out-of-the-box thinking.