How To Find And Protect Sensitive Data In APIs

Data is the primary fuel that powers the modern internet. APIs require data to communicate and deliver the fantastic benefits we have seen in the modern internet. However, this data is not just simple 1s and 0s. It often represents the identity of individuals and groups, reflecting their desires, qualities, and personal information.

Securing sensitive data is, accordingly, as much a business imperative as a moral one. But how can organizations find and protect this sensitive data, especially in an API-driven environment where data is as good as gold?

What is PII and Sensitive Data?

When we discuss finding and protecting sensitive data, this data falls into two broad categories: personally identifiable information (PII) and general sensitive data. What do these terms actually mean, and when are they applicable?

PII has been defined differently by many organizations and regulatory bodies. In the United States, the National Institute of Standards and Technology (NIST) defines PII as follows:

“Personally Identifiable Information: Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. (NIST SP 800-79-2)”

In Europe, PII can be broadly defined under GDPR as “personal data“:

“‘Personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person…”

Sensitive data, on the other hand, is less well-defined. While there is a category of sensitive personal data in GDPR, sensitive data in the context of APIs is far less pointedly defined. Broadly speaking, sensitive data can be considered a wider range of items than PII, and as such, it certainly includes that data.

UpGuard provides a straightforward definition of sensitive data:

“Sensitive data is confidential information that must be kept safe and out of reach from all outsiders unless they have permission to access it.”

Accordingly, it may be easiest to think about these two terms as a balance of specificity. PII is a well-defined group of data points, whereas sensitive data is a catch-all term for all data that should be protected from external access (including PII itself). In essence, sensitive data is any piece we want to keep private. If an API owner is uncomfortable printing the data on a posterboard and posting it on their lawn, then it’s sensitive data.

Regulatory Considerations

An important point is that protecting this data isn’t just a nice thing to do — in many cases, it’s the law. While some data points must obviously be protected, such as credit card payment information under PCI DSS, other data, such as someone’s gender, identity, location, and so forth, fall under different regulations that may not be as transparent at first glance.

In the European Union, the General Data Protection Regulation (GDPR) is the regulatory document governing this data. Not only is it very specific about what constitutes PII and sensitive data, but it also comes with enforcement mechanisms that are quite stark. For severe violations, fines can be up to 4% of global turnover or up to 20 million euros, whichever is higher. For basic infractions, fines can be up to 2% of global turnover or 10 million euros – again, whichever is higher.

The United States is markedly behind the curve regarding privacy regulations, but even it has some coverage and measures to enforce these standards. In California, the California Consumer Privacy Act ensures some coverage of PII data. Federally, sensitive data such as healthcare information is covered under the Health Insurance Portability and Accountability Act, with significant fines for breach of privacy.

Simply put, poorly securing personal information and sensitive data can have enormous regulatory and fiscal impacts. But that’s not the only thing organizations should consider.

Also read: Top 3 Global Standards That Impact Personal Data Sharing

Brand Trust and Safety

Even where data is not protected by regulation, organizations should consider the impacts on their brand and the safety of their platform. APIs that collect data on the edge of personal information, such as inferred economic status or identity information, nonetheless collect data that, if leaked, could leave a mark on their corporate record.

We’ve seen this before — stories of companies going out of business following a data breach are a dime a dozen. And even if they manage to stay in business, they often do so with a significant loss of customers and revenue.

If users can’t trust an organization to secure their data, they are less likely to give that data to the organization. This is a death knell for companies dealing with data selling. It could also hinder algorithmic content, internal advertising, or user support. After all, how do you support a user who refuses to give you their email, username, location, or any other information, all because they rightly don’t trust you to secure it?

Finding Sensitive Data in APIs: APIs are Leaky

With this in mind, it’s helpful to remember that APIs are verbose. Developers design APIs to connect systems and exchange information, and many of the most high-profile data breaches came not from outright illicit internal activity, but from basic negligence or misunderstanding of configurations. Something as simple as a misconfigured data storage solution could result in hundreds of millions of exposed records, undermining the reputation of your organization and exposing the private information of users worldwide.

Accordingly, half of the battle is actually finding what sensitive data is even being exposed in the first place.

Automated Scanning and Discovery

With the broad adoption of LLM-driven AI security offerings, it has never been easier to scan and discover vulnerabilities. Solutions such as Salt Security offer automated solutions for detecting exposed endpoints and weaknesses in your security posture.

It’s important to note that this process depends on open documentation from the developer and a willingness to find fault. Simply obfuscating an endpoint and considering it “secure” is not good enough. If your data exists, there’s a way to get to it, which may not be as clear as you think. Accordingly, complete testing is required to get a holistic view of the security posture.

Misconfiguration is a huge cause of data exposure in the wild, so doing your due diligence with internal automated scanning and discovery is going to pay huge and immediate dividends.

Once you have completed the internal review, you should look at what data is accessible externally. The best way to ensure proper security from this posture is to dive into the various endpoints on offer, enumerating all endpoints and scanning them for vulnerabilities.

These vulnerabilities might be obvious, such as misconfigured or absent security. However, in some cases, they may be less obvious, such as simple broken access control or privilege escalation. Enumerating the endpoints and scanning them for common vulnerabilities can help secure your posture, but it requires using a trusted partner.

Data Classification

A big part of the privacy and security battle is knowing what data you are collecting and classifying it appropriately. Developers should know from day one what data they are collecting, and if this process is being applied to an extant product, a full review and audit of data collection is worth doing.

This data must then be classified and given consideration as to how it’s handled. Some data is not necessarily personally identifiable. For instance, you can’t reasonably infer something about a user based on a timestamp, except patterns of use that suggest a regional locale. However, this data may become PII in concert with other data. Accordingly, all data is worth securing with reasonable accommodation. Unless the data needs to be externally accessed, it’s better to treat it all as needing to be secured.

That being said, some data, such as financial or healthcare data, is clearly covered under more stringent security enforcement and should be segmented from other data and provided a higher level of scrutiny. In addition, it may require documentation that clarifies the security handling for regulatory purposes.

Consider all data being collected and sort it into categories of privileged access. Ensure that only what is necessary for public consumption is actually made public, and secure the rest appropriately!

Protecting Data

Implementing Proper Authentication and Authorization

Proper authentication and authorization is a major factor for appropriate data security. In essence, authentication ensures that whoever is accessing the system is who they say they are, and authorization ensures they have the right to access what they want.

It is important to note that authentication and authorization are only as good as the underlying data scheme. If you have not correctly set aside what roles can access what data, then a strong authentication and authorization schema is like a paper lock — the ultimate illusion of security.

Once you have a proper security plan and the systems to implement it, you must ensure that this plan is adhered to in practice. So often, an admin account or privileged exploit is used to undermine core security. To avoid this, adhering to a security plan will require consistent security audits, rotating roles, role-based access control rooted in security-first principles, and more.

Utilize Adequate Encryption

Even if you manage to secure the influx of requests coming into your system, you must still protect against simple data exfiltration. The reality is that data in transit can be seen any time it goes over visible transmission lines, such as across the internet. What’s more, data at rest could, in theory, be stolen should anyone get physical access to the hard drive or remote access to the server cluster.

Accordingly, you need to ensure this can’t happen. The best way to do this is to deploy encryption. Encryption is, in its most basic form, a method for scrambling data so that it only makes sense if you have a specific piece of information (the “key”) to unscramble it.

Organizations must be aware of two kinds of encryption. The first is encryption in transit. This kind of encryption allows senders and receivers of data to scramble and unscramble the data so that if anything is captured “in the middle,” it is rendered unusable. This helps to secure the data in transit, but once it’s stored, you must also encrypt it at rest. This ensures that, even if the data is exfiltrated, it will be useless to the thief due to the sheer economic and resource cost to crack the encryption (or at least useless long enough to cycle passwords and alternate data).

Also read: Use OAuth to Mitigate ‘Top of the Top’ API Risks

Choose Not to Collect

One major strategy for protecting data seems almost farcically simple: just don’t collect it in the first place. Organizations commonly collect massive amounts of data for potential use in the future, yet much of this data is not necessarily helpful, well-structured, or even potentially profitable.

To that end, the best strategy is to not collect anything except when necessary. Reducing the amount of data collected minimizes the amount of data that is potentially exposed. It also decreases the processing and encryption that must be done in the first place.

For years, the tech industry has viewed data as good as gold. The reality is that encryption and other security efforts come with a cost, and just holding onto data because it “might” be worth something, especially considering that the data in question belongs to the users, not the developer, is just eating cost and risk for a potential upside that may never be realized.

Protecting Sensitive Data in APIs

Finding and securing PII and sensitive data is an incredibly important part of developing an API. With a few simple considerations and processes, any organization can adopt a better security posture, resulting in better user experience, higher organizational trust, and better outcomes at scale and over time.

What do you think of these considerations? What more can be done to find and protect sensitive data in APIs? Let us know in the comments below!