6 Techniques for 99.999% Uptime

6-techniques-for-99.999-uptime-nordic-apis

A system is only useful if it can be used. If a system can’t be accessed, it might as well not exist. As far as APIs are concerned, this principle of availability is referred to as uptime.

Uptime is the state of a server or service, and the reliability thereof. Increasing uptime has several benefits, most of which is a boon to usability. Add on to this a reputation for stability, increased security, developer experience, and customer service, and the benefits of high uptime become clear.

Having consistent and strong uptime is the lifeblood of an effective API. Luckily, there are some very effective solutions to implement, maintain, and secure high uptime.

Six Techniques for Increased Uptime

Technique 1 – Be Aware of Your Physical Assets

One of the easiest ways to ensure high uptime is to be aware of the physical assets in a system. By knowing what the points of potential failure are, liabilities can be monitored, and developers can react quickly when an issue does arrive.

This level of awareness ranges depending on the scale of the given service. For instance, in the early days of the API space when APIs were reserved for large companies with physical server rooms, these concerns would focus almost entirely on immediate environmental and internal threats.

This still holds true for many corporate clients — an API developer utilizing corporate resources needs to know the age of the servers handling the code, their particular uptime numbers, any temperature issues in the server stack, and so forth.

However, for the average developer in the modern space, this isn’t really a concern. Many developers utilize cloud computing, PaaS, Saas, and IaaS for hosting their API functionalities.

The concerns from the corporate space are just as valid as they once were, but have been moved into the virtual space on a much larger, general scale. You don’t need to worry so much about the heat lanes of your servers, sure, but be aware that a cloud provider based in earthquake prone countries has an increased chance of uptime failure than a provider in a country with relatively stable geological conditions.

Likewise, be aware of the distance from your customer to your service. While great strides can be made to optimize APIs and decrease ping, the fact is that data transmission takes time, and many systems may time out on request if the request is given to a distant, heavy-load server.

This comes down to being aware of the location of end API customers and their requirements, to match data servers on the table to these qualifications. Userbase size and average request volume, proximity to data centers, server capacity, the location of the physical assets, and relevant considerations need to be made to ensure that projected requirements match with actual assets. Understand the limitations of your physical hardware.

Technique 2 – Overestimate Server Capacity Limits

One of the worst things an API developer can do is to set server capacity limits too low.

A great real-world example of why is called the “Reddit Hug of Death.” Anyone who frequents Reddit knows how this works — a website with an awesome picture, a great game, or even a passing blog post attracts the attention of hundreds of thousands of users who turn the content viral. Before long, the site hosting this content is brought to its knees, first with slow traffic, and then failure to connect.

This is exactly what happens when a projected user base isn’t matched with the resources on hand. This sort of failure contributes to low uptime, and is something that must be considered for any “X-as-a-service” on the web, where high-traffic is always a potential.

To avoid a dreaded HTTP 503 error, make your limitations scalable from the start — adopt cloud servers or API management solutions that offer flexible scaling plans that automatically respond to increased traffic.

Technique 3 – Create Backup Logical Routes

Imagine you’re driving to work, using your typical daily route. As you start to merge onto a familiar road, you see it — road construction and a bright detour sign with an arrow.

Where would you be without that sign? Without a reroute in place? What if large construction projects didn’t reroute, and rather told you that “this is no longer a road — good luck finding the right way!”

That’s exactly what will happen with your API if logical routes aren’t established. No system is perfect, and no path is permanent. As paths get taken down for repairs, updates, and upgrades, users get their traffic unceremoniously denied and dropped unless an additional path is there to take up the slack.

The way to solve this is simple — maintain failover. Failover is the idea that data needs to have alternative paths in order to always have a conduit for exchange. Implementing more network interfaces, spreading the server load using various load balancing techniques, and establishing redundancy and abstraction through implementations such as an API Gateway or Docker container can go a long way to inoculate a service from failover issues.

Luckily, the modern API development space provides many failover solutions for the modern developer. Because data centers are the primary source of data handling in the modern environment, many of these solutions are cheap, quick, and simple to implement and maintain.

Take for example the concept of a backup server. In the old days, implementing a backup server would have huge cost for each physical unit in addition to the increased administrative and personnel cost, as well as the increased issues in the security realm concerning an additional node in the network.

In the data center approach, a server can be virtualized, removing the physical constraint. Virtualized servers function exactly the same as physical servers, but can be spun up independently of other nodes at the whim of the operator for low to no cost.

With such a cheap solution in the form of virtual servers, API developers can create redundant servers behind an API Gateway which is itself cloned over multiple virtual servers on independent nodes at the data center. This means that if a physical server crashes, the multiple virtual backups can kick into gear, distributing the traffic to server clones and making for a seamless client experience.

Blog-Post-Wide-CTA-API-Stack

Technique 4 – Code Proactively

The server space is only half the battle when it comes to what could cause uptime failures. The code itself is a huge contender, here — creating a system that allows for errors to be caught before they enter production builds is just as important as making sure those builds are accessible, as such an error could cause a system to fail. Likewise, exceptions, memory faults, and even buffer overruns have a huge role to play in uptime, and should be treated as such. Doing “just enough” to get by is highly ineffective and unproductive.

There are a wide variety of solutions that could be put into place. One of the most effective ways of intelligently handling exceptions is to utilize domains as a method of catching errors. In Node.js, the following piece of code can do just that:

var d = require('domain').create();

d.on('error', function (err) {
  console.log("domain caught", err);
});

var f = d.bind(function() {
  console.log(domain.active === d); // <-- data-preserve-html-node="true" true="" throw="" new="" error="" uh-oh="" settimeout="" f="" 100="" code="">

While the code is different in other languages, the concept is the same — isolating error domains and providing for an intelligent response aids in the availability of the service in general. As part of this, flood prevention, specifically flood limiting, can be implemented as a solution as well, reducing the impact of unexpected traffic and requests.

Not having adaptive code is a death knell — failing to adopt to failure situations and memory overflow means that even one poorly formed and prolifically wrong request could take down an entire system. Additionally, API security and functionality is solely the responsibility of the coder, and coding proactively prevents the shifting of blame that so often happens with larger APIs.

Technique 5 – Be the Enemy

One of the best ways to ensure your API has the highest uptime possible is to “be the enemy”. Become a white hat, and consider what your enemy would exploit using the API in question.

To find these vulnerabilities, abuse the system as hard as you can — those who plan on attacking your system certainly aren’t going to be friendly to the API in the real world. Test your database integrity and sanitation, and attempt to break everything. Some things to try:

  • Check the API for bugs
  • Look through your overflow statuses and situation
  • Consider how the API would deal with high traffic
  • Understand how your errors during initial writing and testing occurred, and frankly look at whether you properly patched them, or simply did a “good enough” job
  • Run unrealistic tests. Bombard your API with two to three times the traffic you ever expect it to get,
  • Send malformed requests with both simple errors and complex errors
  • Abuse!

It’s tough love, but the harder you are on the API now, the more likely you’ll be to catch a potentially fatal error before it destroys your uptime.

Technique 6 – Adopt a Two-Pronged Development Strategy

Finally, one of the most effective things an API provider can do to secure high uptime is to segregate and isolate potential issues between builds. When we discuss these builds, we often use the term “beta” to refer to content that is not yet in “usable state”, or at the very least content that is unstable.

The problem here is a realm of extremes. If we prefer stability in a single-pronged development strategy, we would stick with proven solutions and stable implementations. Accordingly, our API would have great uptime, but would be fundamentally non-progressive.

On the other hand, we could have a single-pronged development strategy wherein we are constantly revising, constantly updating, and constantly upgrading. The problem with this approach is that it forgoes stability for experimentation, resulting in lower uptime.

So what’s the solution? Simple — implement a two-pronged strategy rather than a single-pronged. This type of development is called Two-Speed IT, and it relies on the discrepancies between the two approaches to create an effective development ecosystem.

By creating a channel for “beta” building and feature testing secondary to a primary, “stable release” channel, you allow for the integration of new features while allowing for stable usability and increased uptime. This also has the added benefit of allowing the real-time monitoring and testing of potential fault points as features are developed.

More specifically, this approach has the added benefit of resulting in an ecosystem for developers where the weaknesses of a given feature can be tested in a scaling nature. Because APIs can whitelist users for certain build features, these features can be tested on small user groups first, and then whitelisted for additional users as each level is cleared.

By doing so, developers can reduce the userbase for which uptime failure is prone in a beta build, and when failure inevitably occurs, reroute that traffic through channels established earlier, allowing for useful and transparent testing.

Uptime as a Security Component

While high uptime is inherently good, failure to implement the highest possible uptime given budget and hardware constraints not only comes with a variety of drawbacks, but a single, glaring failure — security.

Security in the information technology space often references the concept of “CIA”, that is, Confidentiality, Integrity, and Availability While confidentiality (keeping information private) and integrity (making sure information can’t be changed unless the developer allows it) are very important aspects of API security, availability (the ability to access a service) is just as vital.

high-uptime-for-API-runtimeWhile low uptime can have disastrous economic impacts, the security component is perhaps the most important of uptime concerns. Uptime issues can be broadly separated into two topics — “ability to accept traffic” and “ability to successfully route traffic”. These two concepts have a great deal of crossover, but are distinct in their impact on the security of an API.

The ability to accept traffic is what first comes to mind when talking about uptime, and rightfully so — accessibility and usability go hand in hand with high uptime. This ability can be cut off in a variety of ways, from DDOS (distributed denial-of-service) attacks to hardware and gateway failures.

This is largely an “extrinsic” discussion — that is, availability when it comes to accepting traffic is decided upon almost entirely by factors extrinsic to the API itself, whether it be the server hardware, coordinated attacks, or environmental factors.

The second topic is largely “intrinsic”, however — the ability to route traffic. While routing is arguably concerned with hardware routing utilizing switches and hubs, this topic discusses the issue in a more internal way, focusing on the code function and directives designed by the API developer.

By being able to recognize, limit, redirect, and route traffic internally on a server, an API developer is able to “stem the flood” so to speak, reducing the flow of data to a manageable level, and leveling out the rate across multiple servers. Additionally, improving routing allows for the ability to host certain resources and functions on one server, and others on another, creating de facto load balancing and ensuring that services do not affect one another in high-volume situations.

Economic Cost of Poor Uptime

From an economic standpoint, having poor availability and uptime is incredibly damaging, In 1998, a report by IBM Global Services estimated that in 1996 alone, American businesses lost $5.54 billion in revenue and productivity due to poor availability — and this was during an era where API usage was limited, compared to today’s world wide web all but depending on external APIs.

For a more recent example, we can look at the September 2010 reservation desk failures at Virgin Blue airlines. The failure, which cancelled 130 flights and delayed more than 60,000 passengers, resulted in between $15 and $20 million dollars in damages.

Virgin Blue’s losses weren’t just monetary, either — their reputation was hit dramatically. As with Virgin Blue, having routine availability failures and a low overall uptime can have disastrous effects on the reputation of your API.

It’s incredibly hard to get a userbase, but it’s just as hard to sustain a userbase. Having a poor uptime history can make users migrate away from what they perceive as an unstable system prone to failures. If an API developer can’t even assure the most basic of connections, what’s to say they’re keeping my data safe, or the data of my customers?

Simply put, all of this can be cheaply rectified before it ever becomes a production problem; when it does enter the production realm, however, the problem becomes exponentially more expensive to repair, especially when it begins to affect users.

Conclusion

Ensuring uptime is a complicated matter. There are entire companies whose goal is to sell developers their latest and greatest solutions for high uptime, and one could spend thousands of dollars pursuing a multitude of nines. 99% uptime may seem great, but that’s nearly 15 minutes offline each day!

chart for uptimeFor corporations, seconds offline can equate to millions lost. But for most API providers, reaching “nine nines” is a bit extreme. The techniques suggested herein are designed specifically to increase uptime with the resources at hand. By being aware of the underlying physical systems, setting high capacity thresholds, creating variable logical routes, coding for possible points of failure, brutalizing the API, and adopting two-pronged IT solutions, API developers can ensure a high base level of uptime.