Analyzing Trends Across 200,000 OpenAPI Files

In today’s microservices climate, it’s no surprise that APIs remain immensely important during the coding process. Accordingly, the continued rise of the OpenAPI standard (and the greater OpenAPI Initiative) has played an integral role in fostering vendor-agnostic API development and descriptions. Describing these APIs is key to documenting their underlying mechanisms. To accomplish this, developers commonly turn to OpenAPI files — filled with information about endpoints, endpoint operations, functional parameters, authentication, and even licensing. These specifications are written in YAML or JSON.

We know that developers are creating these unique files out of necessity, but how are they doing it? An Australian information-security company named Assetnote posed the same question, subsequently scanning the web and unearthing nearly 200,000 public OpenAPI files. Their goal was to uncover any sweeping API trends — while generally exploring the relationship between specifications, API functionality, and the methods, parameters, headers, or values that make APIs interactive. Postman soon added their own analysis based on these findings. What did Assetnote and Postman determine? Follow along as we summarize the research below.

The TLDR

After discovering Swagger files on the web through various sources, via brute-force scanning — plus Assetnote’s Kiterunner tool — the team concluded that few developers faithfully follow the OpenAPI specification. Many resorted to ad-hoc coding methods and descriptions.

Both Assetnote’s and Postman’s analysis centered on OpenAPI 2.0 definitions, admittedly skewing some statistics on the presence of callbacks and links (to name a few examples). Interestingly, 79% of API definitions are valid — meaning they accurately described the API’s role, function calls, data structures, standards of interaction, and resource usage. Again, the numbers for newer OpenAPI 3.0 files are unclear.

APIs are getting more complex. On average, each API has 37 paths and 51 endpoints. They also have, on average, 38 distinct query parameters and 33 defined schemas. The bulk of an API’s coding is focused on server resource requests. GET was the most popular HTTP method by a large margin, followed by POST. Most APIs originated in Western Europe (due to unclear circumstances) and North America. There are also many more APIs floating around out there that are difficult to access. On SwaggerHub alone, Assetnote could only collect roughly 2.3% of existing spec files — due, ironically, to API fetching limitations.

Some security methods, if they’re even employed, reign supreme. Most APIs tend to favor basic authentication URLs or LDAP user registries to provide user access authentication. The same is true for apiKey methods. The third most popular option is providing no security at all, which is somewhat troubling. While Postman didn’t share percentages for each practice, Salt Security’s 2021 State of API Security report found that 27% of orgs have no API security strategy in place.

That’s the gist of things. How did Postman and Assetnote conduct their research?

Research and Analysis Practices

Assetnote grabbed its data using a variety of sources and tools, starting with the BigQuery data warehouse. The team pulled from BigQuery’s public datasets, and specifically GitHub’s subset, to grab updated Swagger files. These were extracted from datasets containing many terabytes of data.

Overall, 11,000 Swagger files were pulled just by centering on appropriate file pathways. Next, Assetnote’s researchers obtained approximately 3,000 OpenAPI files from the APIs.guru Wikipedia-style directory. A public REST API made this possible. SwaggerHub then accounted for an additional 10,000. Assetnote also set up its own webserver to assist in scanning the internet — a venture which amassed almost 45,000 additional files, using a custom Zgrab2 HTTP module. Finally, tooling-backed brute-forcing helped uncover the remainder of those 200,000 total files from Assetnote’s haul.

Assetnote made this information available via two public tarball file packages. This is where Postman comes in. The development platform downloaded and inspected this data for any notable trends. Both an AWS EC2 instance and Node.js script were integral in making that happen. A standard JSON parser, two YAML parsers, and one home-grown JSON parser with more forgiveness led the data-ingestion effort. Postman ignored any HTML pages and 404 pages.

The results were automatically organized into named columns and rows. Duplications were removed, and special work was done to ensure API definitions were indeed valid. Postman was able to compile a list of statistics from there.

Conclusion

Given that OpenAPI 3.0 files were ignored, it remains to be seen how well developers abide by the official specification in light of any guideline changes. Should the standard grow more strict, we could see descriptions and definitions become more compliant — although a certain degree of “creativity” seems to be ingrained within the API community. Whatever the case, an immense amount of data is still out there to be discovered. Trends remain just that, and their fluidity will help determine the trajectory of OpenAPI development to come.