What Data Formats Should My API Support?

what-data-formats-should-my-API-supportWhen considering the design, implementation, and maintenance of APIs, one of the most important factors to consider is data formats — how the API handles the interaction between data generation and data request. This path between generation and manipulation, typically between server and client, is the crux of the API ecosystem.

Accordingly, the types of data formatting that are supported in an API ecosystem are incredibly important, and could be the difference between an effective and powerful API and an underutilized piece of kit. Adopting the correct format makes an API functional and useful.

Today, we overview data formats (JSON, XML, YAML, etc.) that APIs commonly deal with. We’ll walk through the differences between each, and outline why a certain format may be better than others for your API’s specific situation.

Why Data Format Support is Important

An API is a bridge that connects disparate users, services, and sources to other users, services, and sources. That is the most inclusive definition of an API — regardless of what an API does, how it does it, or why it does it, an API is a bridge between two points. The issue at hand, however, is how that bridge is traversed. The connection to the resources in an API must be rendered in a way that is both usable to the requesting party, and recognizable in a way relevant to the type of data being presented.

This is why data format support choices are so important — an otherwise amazing API, with wonderful architecture, implementation, and marketing strategy will be dead in the water if the data format support is incorrect. The initial choice of API data format will determine how effective the API is, affecting use rates, the success of routine or specific calls, and the long-term adoption and retention curve over the duration of its lifecycle.

Common Industry Formats

Now we’ll overview various data format types. We’re going to separate the most common industry languages of the day into four general categories:

  • Direct Data Formats
  • Feed Data Formats
  • Manipulation Data Formats
  • Database Data Formats

As with many categorizational attempts, there is a certain amount of overlap between these distinctions. To make this conversation easier, however, these languages will be grouped by their specialty rather than their broad purpose support.

Direct Data Formats

Direct Data Formats are designed to handle data directly between machines. These languages are often called machine readable, as they tend to be dense and compact. This means they are great for machine-machine integration, and/or manipulation with other APIs.

Direct data formats are best used when additional APIs or services require a data stream from your API in order to function. The three most common formats in this category are JSON, XML, and YAML.

JSON

JSON (JavaScript Object Notation) is a wonderful format when it comes to handling client-side scripting, and is a generally faster counterpoint to other options such as XML. In many ways, it might not be as powerful or widely used as other choices, but it supports many features that make its adoption a powerful choice.

For example, JSON supports a distinction between number, string, and boolean that XML lacks (as seen below in an “age gate” example); as a counterargument, however, XML handles mixed content far better than JSON, especially in the case of mixed node arrays requiring detailed expressions.

{
	"title": "Age Gate",
	"type": "object",
	"properties": {
		"firstName": {
			"type": "string"
		},
		"knownValue": {
			"type": "boolean"
		},
		"age": {
			"description": "Age in years",
			"type": "integer",
			"minimum": 18
		}
	},
	"required": ["firstName", "lastName"]
}
Formats is one of the pillars of Designing Evolvable APIs for the Web

XML

XML (EXtensible Markup Language) likewise has some large advantages. While it is all but ubiquitous, with a large install base, its greatest strength may be its greatest weakness. It’s greatest strength? It’s the “kitchen sink” of formats, throwing in everything it possibly can. Its main negative? It’s the “kitchen sink” of formats, throwing in everything it possibly can.

While having all the tools you could ever need at your disposal is a great boon in concept, it makes XML heavy and slow, and with already inefficient handling of certain computations (especially in the realm of XML transformation), developers are often faced with a choice between more functionality with slower speed (XML) or less functionality with more efficiency (JSON).

Below is an example of a mixed schema age-gate mirroring and extending the functionality of the previous JSON example.




  
    
    
    
  

YAML

YAML (YAML Ain’t Markup Language) is “is a human friendly data serialization standard for all programming languages.” Where JSON is lightweight with a somewhat lax feature set, and XML is verbose but often a bit cumbersome, YAML is easy to read, lightweight, and generally “middle of the road”. While it’s classified as a direct data format due to how it’s used to parse configuration settings and relational queries, many systems use it as a basic flat database.

Below is a simple age-gate implementation which calls files on the server given certain conditions.

# failover url
url_403: /
#url_403: https://example.org/underage

# snippet definition
snippet_enter: /templates/verified.html.twig
snippet_exit:  /templates/underageexit.html.twig
snippet_403: /templates/validate.html.twig

# minimium age config
min_age: 18

# container variable
width: 300
height: 300

# container bg
overlay: '#ffffff'

Technically a superset of JSON, YAML is embraced by many for its readability, the ability to easily reveal hierarchies, and a minimal amount of cruft. It also has a broad base of well known and understood tools — RAML is a YAML-based language for describing APIs, for example. YAML is widely used in gaming servers as well, with services like Bukkit famously using it to unify client modifications with Minecraft servers in order to augment the official API calls into new and wild forms.

Feed Data Formats

Feed Data Formats are an entirely separate beast. While still considered machine-centric, Feed formats are tied more to user utility rather than machine usability. These formats are designed to alert users to a change in codebase, alerts to changes in webpages, and management of assorted modifications to domains and services. Formats in this category include RSS, Atom, and SUP.

Feed Data Formats are typically used to serialize updates from various servers, sites, or front-end interfaces, and alert users to these changes. These changes can be automatically imported and integrated, marked as an update, or modified for usage.

RSS

RSS (Rich Site Summary) is the most widely Feed Data Format. RSS has become solidified as the “official” feed methodology by WordPress and other blogging platforms. It’s a very simple format to use, but it has significant issues in spite of its simplicity. RSS excludes well-formed XML markup, favoring simply plain text and escaped HTML, and lacks support for a lot of fields that could make it a far more powerful system than it currently is.

Atom

Atom, on the other hand, was designed to rectify these issues. Built from the ground up to support auto discovery and identification as well as more complex markup and media, Atom is the big brother of RSS. Unfortunately, this integration comes with a cost. While more media and markup is a great thing, it makes the system more prone to intrusion through malicious code, and the system is overall heavier, resulting in a decrease in performance. This reduction in security can possibly be mitigated through a robust internal culture of security, but at a certain point, might be too far a leap to accept for many developers.

Both RSS and Atom are typically handled via simple feed protocols that are industry standard; because of this, implementation of RSS and Atom simply comes down to tying these URLs and their relevant resources to specific behaviors in the API. Additionally, there are a number of third party APIs that make adding these feeds extremely easy. For instance, Google Feed API implements this with simple JavaScript calls:


 

SUP

SUP (Simple Update Protocol) is the epitome of “jack of all trades, master of none”. While it takes the best of both RSS and Atom, being both faster than Atom and more verbose than RSS, it comes with the negatives of both formats. It’s faster than Atom, but supports fewer media formats. It’s more verbose than RSS, but is not nearly as extensible as Atom. It’s a good “middle of the road” approach, but is really a middling choice averaging speed with usefulness.

{ "updated_time": "2009-04-28T21:29:20Z",
  "since_time": "2009-04-28T21:24:19Z",
  "period": 300,
  "available_periods": {
    "300": "https://gdata/youtube.com/sup?seconds=300",
    "600": "https://gdata/youtube.com/sup?seconds=600",
    "900": "https://gdata/youtube.com/sup?seconds=900"
  },
  "updates": [
    ["159aa827", "6e19"],
    ["9559d1d", "6e19"],
    ["159aa827", "6f22"],
    ````
  ]
}

The above example is taken from the official SUP specification guide.

Manipulation Data Formats

Manipulation Data Formats should be considered more a “wrapper” than a “format”. These formats are meant to be digested by applications on a client machine, rather than digested by a server or service. In this case, the wrapper is simply a delivery service, not a parsing or translation service. Formats in this category include PDF and KML.

This difference between wrapper and format is paramount to understanding this type. Manipulation Data Formats are transformative in nature. Someone sending a PDF isn’t sending code to simply be digested by a machine, they’re sending a PDF to be opened by a client, signed, manipulated, or presented in browser through extensions. Someone sending a KML doesn’t want their file to be interpreted in any other form than as a wrapper for geocaching or Google Maps coordinates.

This makes the format very niche, of course. If an API handles locations, especially in terms of geography on a map service, KML must be supported. If an API handles signature authentication services for documents, PDF support is required. If an API never handles this sort of data, then there’s really no use case for the format at all.

Database Data Formats

Finally, Database Data Formats are those formats that are typically used to handle communication between databases and other databases or users, an ever-more common function in a world increasingly affected by the Cloud stack. Whereas Direct Data Formats form data upon request and handle that data from service to service, Database Data Formats take generated data and archive it for later use. Formats in this category include CSV and SQL. Support in this realm has to do with how data is stored locally, and what services you intend to tie into.

CSV

CSV (Comma Separated Values) is a very common format, though it has been largely made obsolete by XML. It still has use however, especially when considered within the context of data scraping from MediaWiki platforms, handling the transformation of data from one form to another. It is starkly contrasted to SQL, however, in that it is more often used to transform, rather than to store.

Below is a simple implementation of a CSV parser:

fetchOne();

//get 25 rows starting from the 11th row
$res = $csv->setOffset(10)->setLimit(25)->fetchAll();

SQL

Storing data is more often handled by the SQL format, largely due to the file and relational structure, as well as the widespread enterprise adoption of SQL and the related MySQL implementation.

This wonderful implementation from the code.google.com php-sql-parser implementation page shows how easy SQL is integrated into a system utilizing PHP:

Array
(
    [OPTIONS] => Array
        (
            [0] => STRAIGHT_JOIN
        )      
       
    [SELECT] => Array
        (
            [0] => Array
                (
                    [expr_type] => colref
                    [base_expr] => a
                    [sub_tree] =>
                    [alias] => `a`
                )

            [1] => Array
                (
                    [expr_type] => colref
                    [base_expr] => b
                    [sub_tree] =>
                    [alias] => `b`
                )

            [2] => Array
                (
                    [expr_type] => colref
                    [base_expr] => c
                    [sub_tree] =>
                    [alias] => `c`
                )

        )

    [FROM] => Array
        (
            [0] => Array
                (
                    [table] => some_table
                    [alias] => an_alias
                    [join_type] => JOIN
                    [ref_type] =>
                    [ref_clause] =>
                    [base_expr] =>
                    [sub_tree] =>
                )

        )

    [WHERE] => Array
        (
            [0] => Array
                (
                    [expr_type] => colref
                    [base_expr] => d
                    [sub_tree] =>
                )

            [1] => Array
                (
                    [expr_type] => operator
                    [base_expr] => >
                    [sub_tree] =>
                )

            [2] => Array
                (
                    [expr_type] => const
                    [base_expr] => 5
                    [sub_tree] =>
                )

        )

)

Functionally, the difference is one of motive. When handling multiple databases that all speak to each other, SQL is the format of choice, as SQL is used to organize data and handle calls between that data. When handling a single database or flat database file utilizing something like YAML that needs to be translated into another format, CSV is the format of choice.

How Does Hypermedia Affect APIs?

The seachange of API formats in the last ten years have altered the API landscape. Since the wide adoption of the world wide web, data has shifted ever-increasingly away from static, descriptive content, and towards manipulative, interactive data, termed hypermedia. An extension of “hypertext”, hypermedia links objects, video, music, and other media to other such sources.

There is a strict line between “media” and “hypermedia”. Whereas media is simply a collection of text, graphics, audio, or video, hypermedia is the joining of interactivity with these other media formats. Accordingly, formats have had to change to support this revolution. Thus, codec handling, file formats, tagging, object-orientation, database handling, and more have become incredibly important.

As we discuss data formats, keep this in mind — while data formats are incredibly important, how they are utilized has a lot to do with what kind of media you intend on handling, and whether it is hypermedia in nature. Your API might handle video data like a dream, but without an RSS feed to notify users of new content, you might as well have no audience.

Conclusion

Data format support has everything to do with their distinct and particular use case. While you could theoretically spend the entirety of your development budget expanding the format support and translation of your API in order to support anything and everything that’s thrown at it, you ultimately end up with a bloated API requiring more support with greater chance of failure. If you draw down too hard, however, and support only one or two formats, and rare formats at that, you risk being agile and lean but ultimately useless.

So the question that must be asked comes down to this — “can anyone reasonably expect this format in their use of the API?” If you were to use your own API, would you expect support for the format you are considering? If so, implement it! If not, consider whether there’s a better choice, and support that!

Understanding simply what each format does goes a long way to whether or not you want to support it. In summary:

  • Direct Data Formats (JSON, XML, YAML): formats that support the sharing of data directly for use in other systems, are best used in B2B or public-facing API implementations
  • Feed Data Formats (RSS, Atom, SUP): formats that serialize changes and update users to these changes, are best used in subscription industries such as blogs, video sharing, and social media
  • Manipulation Data Formats (KML, PDF): formats in which data is wrapped for sharing in document form, are best used in design and communication oriented industries
  • Database Data Formats (CSV, SQL): formats in which data is categorized and stored in database formats for interpretation, are best used in analytic dependent or long-term data utilization implementations

Ultimately, formats will come down to what is best given your specific implementation. Additionally, the age old architectural and developmental argument of lean and fast vs. slow and expansive argument should be a serious consideration; balance these features with the requirements and expectations of the average user, and you’re sure to find API success.