Chaos Engineering For APIs: Review of Gremlin

We review Gremlin, a tool for API testing based on a chaos engineering ethos

Disclaimer: This is NOT a sponsored post. Our coverage is part of our effort to highlight new, interesting tools in the API space.

The internet is an extremely complex place. It’s this complexity, of course, that has made the technology so disruptive. However, for all the great things this chaos causes, there are some fundamental concerns that lead to service failures.

Gremlin is an offering that is focused on managing this chaos, or at the very least understanding it and approaching it from a usability and engineering perspective. Unlike other solutions, which aim to fix the chaos, Gremlin seems more concerned with managing the reality of the chaos, and engineering the situation into the most positive of configurations.

For API developers, Gremlin can be utilized for testing purposes by creating the exact granular conditions in which failure often occurs. In other words, you can subject your API to real-world attacks in a non-real-world situation. For this reason, it bears a bit more looking at.

Today, we’re going to look at Gremlin, and find out the ethos driving their approach. We’ll look at some history behind chaos engineering, and understand how Gremlin itself manages the chaos underlying the modern web.

Engineering the Chaos

Gremlin is a unique approach due to the fact that, not only do they recognize the chaos of underlying systems, they seem content with working within its bounds. While other solutions seem to want to organize the chaos and thereby eliminate it, Gremlin aims to leverage the chaos for a benefit.

So how does it do this? Put simply, Gremlin tries to simulate a variety of environments and situations in which the solution at hand is put through the chaos gamut – it’s akin to the old “move fast and break things” adage, except more along the lines of “move fast, break things, and learn lessons”. By simulating the environment in which an API may fail and pushing the API to the limit, you should be able to find these vulnerabilities and issues ahead of time before the stakes are real.

To quote the Gremlin documentation:

Chaos Engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news.

Chaos engineering lets you compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems.

The Process of Engineering

Roughly speaking, Gremlin’s approach follows three basic steps:

  • Plan an Experiment: This stage of the process is where the framework for chaos engineering actually starts. Gremlin wants you to consider what could go wrong for your API – what situation could cause failure? Once that is known, everything else can fall into place.
  • Contain the Blast Radius: This is essentially a testing stage of least impact. Here, the test is at the most basic, most small microcosm of itself as a basal test of what the environment can actually do. The smallest test that will give you actionable data is going to be found in this stage, as it continues the groundwork lain while planning your experiment without pushing the API into the “noise output” stage of testing.
  • Scale or squash: The final step, Gremlin advises that this is either where you identify the issue or scale up until you are at the full scale of your API. In theory, if you’re able to scale up to your maximum theoretical API load, you can be relatively sure that your results are going to scale in real life.

Gremlin is an interesting approach in that it ultimately provides a sort of “Failure-as-a-Service” model. While this has its own caveats and considerations, notably the issue that fixating on particular issues could result in being “bogged down” in minutiae, the approach is still valid as a general testing solution.

What is Gremlin?

Before we go further, there is a question as to what Gremlin actually is. On the one hand, much of what we’re about to talk about seems almost like it requires the codebase to be integrated into the API, and yet, the service is not an architecture or a framework. On the other hand, the attacks do damage akin to being internal. The question then becomes – “what is Gremlin in relation to my API?”

The simplest answer is that Gremlin is a component service for testing. Don’t think of Gremlin so much as an architecture or framework, as this implies more heavy integration than is really occurring. Don’t think of it as a penetration testing service or an external provider, either – what’s happening is much more local. At the end of it all, Gremlin bills itself as a “full suite of enterprise-grade failure testing modules”, and that is as apt a description as one could hope for.

Attack Types and Testing

Gremlin has a ton of specific attacks. These attacks operate in stages, moving through each stage in order of importance. Gremlin defines these stages as follows:

Running| Attack running on the host—–:|:—–:
Halt| Attack told to halt
RollbackStarted| Code to rollback has started
RollbackTriggered| Daemon started a rollback of client
InterruptTriggered| Daemon issued an interrupt to the client
HaltDistributed| Distributed to the host but not yet halted
Initializing| Attack is creating the desired impact
Distributed| Distributed to the host but not yet running
Pending| Created but not yet distributed
Failed| Client reported unexpected failure
HaltFailed| Halt on client did not complete
InitializationFailed| Creating the impact failed
LostCommunication| Client never reported finishing/receiving execution
ClientAborted| Something on the client/daemon side stopped the Gremlin and it was aborted without user intervention
UserHalted| User issued a halt
Successful| Completed running on the Host
TargetNotFound| Attack not scoped to any current targets

When an attack is set up in Gremlin, it’s given one or more executions, which is the attack running against a specific process. The attack then progresses through each stage, outputting useful data.

Resource Gremlins

Resource Gremlins are those specific to resource issues for the API. Each call on an API takes a specific amount of resources, and while we often consider the optimum balance to be “good enough”, imbalanced functions and applications on the API could result in huge imbalances between the service in the real world with real load, causing real failure. Gremlim notes the following general Gremlin paradigms as part of their documentation:

  • CPU: Creates data streams to simulate high load for one or more of the CPU cores. This is meant to simulate extreme load on the processors, such as would be the case with a sudden surge of transform, encoding, or other data manipulation requests.
  • Memory: Takes RAM from the process, simulating a massive memory burst. This could mimic both extreme memory demands and a runaway memory leak to see how the API would identify and rectify such an issue.
  • IO: I/O stands for input/output, and this is exactly what this Gremlin stresses. This would be akin to a hard disk being subjected to an incredible amount of read/write cycles above and beyond the normal projected amounts.
  • Disk: Similar to I/O but distinct in methodology, the disk test just writes content to disk to a certain percentage to simulate real-world limitations of storage mediums and random data retrieval.

All of these tests are specific to resources, and as such, are great for testing optimum retrieval, path generation, and other resource-specific calls. Many of these can also help magnify existing conditions that may not be readily apparent, such as identifying minimal memory leaks that would otherwise be considered statistical anomalies – it is much harder to ignore this memory leak when the data is extremely high and the percentage quickly creeps to untenable levels.

Network Gremlins

While resource Gremlins attack local resources, the network Gremlins attack remote resources. These tests allow you to subject the API to changes in the network status or its operational health, testing both recovery and detection systems. Gremlin defines the following Gremlins in their documentation:

  • Blackhole: This Gremlin drops all network traffic of a specific matching pattern, i.e. “all incoming traffic from a specific machine using this specific protocol”. This is great to test identification systems and is best used in combination with specific IP ranges or hostnames to simulate failure of parts of your greater network. This can be used to great effect to test your network redundancy and rerouting systems, as it can effectively simulate the loss of a switch or router on the network – in this way, it can also help test an isolated DDoS on a specific network component.
  • Latency: Latency is the time between a request and its fulfillment. This Gremlin injects latency into the egress network traffic, allowing arbitrary latency shifts to displace the code. This is more useful in simulating network degradation than anything else, as it basically slows everything down in an isolated network branch.
  • Packet loss: In essence, this simply drops packets matching egress network traffic. This is best used when testing packet recovery and resending policies, and when paired with Latency, can effectively simulate a network-wide DDoS on all network-connected appliances.
  • DNS: Blocks DNS server access. Testing local resolution and system backups using this can help to identify some major failures in the internal systems and ensure network functionality can be maintained even if fundamental parts of the network drop out.

State Gremlins

These Gremlins introduce chaos into your system and are probably the most chaotic of all the testing elements in this suite.

  • Shutdown: This causes a reboot, or what the Gremlin documentation calls a “halt”. The idea here is to simulate the loss of a server or a cluster on a server, but you can also mimic something like a power outage. The principle use of something like this is to test recovery and backup planning – this function models a shutdown, so the purpose of said shutdown is really moot, be it a thunderstorm, flood, earthquake, power failure, etc. There’s a bunch of additional value in this sort of command as well, as you can use this to specifically test your memory redundancy approach, especially if the shutdown is unexpected and without signs of stress on the underlying architecture.
  • Time travel: As much as I wish this would take me back to the 50’s in a DeLorean, this function just changes the host system’s time. While this seems trivial, changing the time on a host can have some pretty significant impacts. Everything from daylight savings time errors, which can cause memory and write record issues, all the way to the historic Y2K bug, these all have their fundamental roots in an issue of this kind. Accordingly, being able to test against it is very helpful.
  • Process killer: This does what it says on the tin – kills a specific process. This can help test against dependency crashes, function failures, and poor code routing. This is perhaps the most damaging of attacks in many ways, and should be considered the crowning jewel of the state attack pattern category. Note that, like shutdown, it doesn’t really matter what theoretical reason there is for the process to be killed – it can be literally any hypothetical reason, but the end result of a dead process is still the same, so this fits into a wide variety of attack patterns and threat vectors.

Application Attacks

Application Attacks are specific attacks against the application, typically by leveraging some sort of injection. Gremlin allows you to create attacks by using the ALFI (Application-Level Fault Injection) library and a web UI against specific application and traffic types. By allowing these attacks to be more specific and granular, these attacks can simulate very specific types of attacks.

For instance, if you want to specifically attack the customer address fetching a record, you can specify both the type of application-level fault and the specific matching type of traffic, as well as the duration of the attack and the situation in which this attack lives. You could simulate a call that slowly fails, for instance, or an attack that fails after every nth time.

Gremlin provides the following three examples of this kind of attack in their documentation:

  • Simulate an outage in production by creating an attack on your customer ID only. Then you can look for signs of problems when logged in as yourself, while no other users are even aware an attack is occurring.
  • Simulate a problem with a specific endpoint. Partial failure in distributed systems is quite common – some endpoints may be unavailable while others are working perfectly. In order to simulate such a scenario, you can create an attack targeted to some endpoints only and then determine how your system reacts.
  • Always-on failure testing. If you limit an attack to a set of devices you control, then you can run tests against those devices on a regular basis and evaluate how the user experience works when the system is degraded.

Gremlin and Your API

With all of the discussion surrounding Gremlin’s attacks, a single question comes to mind – what is Gremlin’s relationship to your API? Principally, there is an obvious concern as to securing Gremlin so that its attacks cannot be levied by external users against your API, and of course, there is the concern as to how much of your API is actually exposed.

First of all, it should be noted that Gremlin is built from the ground up to not require any root privileges on the hosts which it tests. As the documentation makes very clear, each “gremlin” user has default Linux privileges, and is in no way a “super user”. That being said, there are four main Linux components that Gremlin has complete control over:

  • cap_sys_boot – This is used for all shut down and reboot attacks.
  • cap_sys_time – This is used for all time travel attacks.
  • cap_net_admin – This is used for all network attacks.
  • cap_kill – This is used to kill a process.

Having this much exposure can be a concern – while the Gremlin user is created with default rights, privilege escalation attacks are not unheard of. Accordingly, Gremlin offers several security systems to reduce the attack vectors for these particular capabilities.

First of all, Gremlin does routine security auditing. This audit is done against the entire codebase, both for its web services and the API, and both the most recent auditor findings and any remediation activities are codified in a Letter of Assessment that can be provided upon request. This is honestly the biggest security solution of their entire offering – while patching issues is great and giving advice on isolation can be effective, actually testing the system and providing documentation for each stage of both testing and remediation is hugely valuable.

To go further, Gremlin does offer some security implementations. First, they offer Two Factor Authentication, which is very useful when preventing unauthorized access to Gremlin accounts. Second, they offer SAML SSO, which is a very powerful way of integrating a federated security system, such as those found on many API backends, into the local Gremlin installation.

Setting Up Gremlin

If you want to see how Gremlin works for yourself, you can follow these steps to install it locally. Gremlin can work on a ton of different platforms, and while the process is very simple to get running, there are small differences between each. Gremlin needs to be installed on each host machine that is being attacked, and due to this, the actual process for installation can vary somewhat between each specific codebase.

To make it simple, we’re going to look at how to install Gremlin in Debian. The first step of this process is to add the Gremlin repository. By adding the repository source into Debian, we’ll be able to ask for the correct packages in later steps. Add the repository using the following code:

echo "deb https://deb.gremlin.com/ release non-free" | sudo tee /etc/apt/sources.list.d/gremlin.list

Next, we’ll import the GPG key, which will let us access the encrypted files for Gremlin’s setup:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C81FC2F43A48B25808F9583BDFF170F324D41134 9CDB294B29A5B1E2E00C24C022E8EF3461A50EF6

From here, installing Gremlin is super easy:

sudo apt-get update && sudo apt-get install -y gremlin gremlind

Simplicity and Granularity

While we’ve spent a lot of time discussing how Gremlin actually works, we should briefly discuss the experience of using Gremlin. Two terms come to mind when working through the documentation – simplicity, and granularity.

Installation on multiple systems, from Ubuntu to Docker, is extremely simple, often concluding in a handful of lines of code. The configuration is often equally simple, and the excellent documentation stands heads and shoulders above others as a great example of simple, informative, and complete documentation.

Setting up the attacks is also very simple, and can easily and quickly be set up through the Gremlin web app. As an aside, just having a web app to set up these attacks is helpful, as the testing experience is separated between what you would expect to see in the CLI versus what you are specifically crafting as an attack – it might be a subtle psychological differentiation, but it is ultimately more helpful than the opposite.

Another thing to consider here is the granularity of it all. Each attack can be specific against the type of application, the type of network traffic, the specificity and schedule of each attack, and more. Gremlin boasts level granularity between 0.01% of traffic to 100% – having that wide a range allows for scaling tests, which is highly important when dealing with production APIs.

Conclusion

Ultimately, Gremlin fits a niche – scalable, granular testing against a real-world API codebase. Whether or not this is the optimal choice in every situation is highly dependent on the API itself and the types of attacks expected. That being said, most APIs in most situations could benefit from Gremlin, even if only to test the resources underlying the API system.

Have you ever used Gremlin? What are your thoughts? Let us know in the comments below.