How to Build Zero-Trust Event-Based Architectures

Many companies break down complex backend flows into microservices. This makes it easier to delegate responsibilities and extend code for each business area. Microservices can communicate via point-to-point HTTP requests, but an alternative architecture is to use event messages that are processed asynchronously. This involves services calling each other via a message broker, where the caller publishes an event message and one or more consumers receive it. Here is an example setup involving web and mobile clients:

There are several architectural advantages to this type of flow. Firstly, it can improve performance and scalability since clients only need to call an entry point API, after which each microservice processes events at its own pace. Secondly, there are extensibility benefits since new consumers can be added without changing the existing code. Finally, event-based architectures can improve resilience and data integrity, since, if temporary failures occur, the message broker will retry delivery until all event messages are processed. All of this requires only simple code in the microservices.

One of the challenges of event-based architectures is that failure conditions can be more complex than point-to-point HTTP flows. You will need to handle problems that occur after a user has successfully submitted their order, and you may then need to inform the user asynchronously. You will also need to design how to manage ‘poison’ messages, which fail permanently.

A more subtle problem is security. When microservices receive asynchronous event messages, they can easily lose security and identity guarantees, which they need for authorizing and auditing access to data. In some systems, it may even be possible for a malicious party to send messages, or to edit secure values stored in event messages, without recipients being aware.

In this article, I will describe the basics of implementing zero trust for event-based architectures. The pattern described can apply to any system that uses event messages and asynchronous flows. This includes end-of-day batch jobs or workflow systems that trigger continuations based on human actions. To best describe the threats, the following sections will first drill into the evolution of security best practices for HTTP requests.

Perimeter Security

It used to be common to apply the primary security only at the perimeter of a corporate network, after which calls inside the network were trusted. For applications using point-to-point HTTP requests, this might be represented as follows, where web and mobile apps authenticate users, then call APIs with a message credential. The main credential verification then occurs in an entry point API, after which headers containing secure values are forwarded to upstream microservices.

These days, this type of security architecture is insufficient due to the use of cloud providers for hosting, more sophisticated cyberattacks, and a greater number of potential threats inside the network. A malicious party inside the network could initiate calls to microservices or potentially intercept messages and replace unprotected secure values, such as those sent in headers. It might even be possible to change money amounts in some cases. Modern systems must therefore adopt a more secure model.

OAuth Architectures

The modern way to secure applications is to use the OAuth family of specifications. These standards provide the most cutting-edge options for authenticating users with one or more proofs of their identity and protecting data in APIs according to business rules. There are many RFC documents that map to company use cases. An authorization server implements these standards so that you can integrate the desired security behaviors into your apps and scale them with only simple code. A key implementation detail is how microservices use JSON Web Tokens (JWT) as access tokens. In the next section, we will describe how this solves the perimeter security problem.

Zero-Trust HTTP Requests

A modern OAuth architecture is shown next, for the case where point-to-point HTTP requests are used. Web clients follow the current best practice of using the most secure HTTP-only cookies when calling APIs. Mobile clients instead use opaque access tokens. Neither of these API message credentials reveal sensitive data such as Personally Identifiable Information (PII). The API gateway then deals with translating received credentials and forwarding JWTs as access tokens to APIs. For more on these topics, see the articles on the phantom token pattern and the token handler pattern.

To implement OAuth correctly, every microservice must then validate the JWT received on every request. This involves cryptographically verifying the digital signature of the JWT. To ensure this is done correctly, JWT best practices must be followed. An attacker then cannot successfully call APIs, since they are unable to provide a valid JWT as an access token. The only party that can create such a JWT is the authorization server, which owns the token signing private key.

Each microservice then checks that the required scopes are present, then implements its detailed authorization based on claims in the JWT access token. Claims include secure values such as user IDs or permissions. If an attacker intercepts an API request and edits the JWT to use a different claim value, the JWT will no longer have a valid signature and will be rejected by the microservice.

When a company builds multiple microservices that are all part of the same business solution, the access token can be forwarded to upstream microservices. This is a secure option since it maintains the user identity of the original client, which can then be used and audited by the recipient. In special cases, such as when the upstream microservice is less trusted, the API can itself act as an OAuth client, and get a new access token from the authorization server for the same user by using one of the token sharing approaches.

Finally, the access token may occasionally expire in the middle of the flow. A common option for handling this process is for the client to refresh the access token and then initiate a retry. Microservices then use standard patterns, such as Request IDs, to ensure that no data is duplicated.

Zero-Trust Events

To implement zero trust for event-based messages, an equivalent approach to that for HTTP requests is followed, and a first attempt might involve simply forwarding the JWT in the event message. Depending on the message broker, the JWT might be sent as a header or as part of the event payload. Each microservice can then simply apply the same JWT verification, authorization, and auditing that it does for HTTP requests. This prevents attackers from sending their own event messages or tampering with stored event messages.

Token Exchange

Although receiving a JWT in event messages achieves the main goal of sharing identity, it also introduces some problems. Firstly, if an attacker somehow gains access to the JWT in the stored event data, they could send it to any API endpoint that accepts any of the sales, orders, billing, or delivery scopes. Secondly, the lifetime of the JWT is a problem, since, in some asynchronous use cases, the consumer may not process the event for hours. When they do, the JWT will have expired, and the original client is no longer available to refresh it.

To solve these problems, a different token must be forwarded in the event message, with reduced permissions and an extended lifetime. This is managed by the publishing API acting as an OAuth client and initiating a token exchange flow using its client credential. The token issued must also be bound to the exact event message in which it is used. An example token exchange is shown below, where the entry point API sends a request with the original JWT access token and some custom fields, then receives a new JWT as the access token:

In this example, scopes for asynchronous continuations have been named with a convention of a resume prefix. The token exchange removes all other scopes to reduce privileges. By default, this should result in a single scope, though when consumers also have upstream consumers, additional resume scopes can remain in the exchanged token. The lifetime has also changed from a short-lived 15-minute token to a long life of one year. Finally, the new token also contains a hash of the event data, which becomes a claim asserted by the authorization server. Additional identifiers could also be included in the long-lived JWT.

Resuming Event-Based Flows

Scopes beginning with the resume prefix can be considered “meta-scopes” that are treated differently from normal scopes. An orders scope might allow access to most API operations, whereas a long-lived token containing a resume_orders scope must restrict usage to one or a few event processing operations. Authorization logic in microservices must reject the JWT if it is used at any HTTP endpoint. For this type of scope, microservices must also verify that the event data matches that referenced in the JWT. This pseudo-code demonstrates the extra checks:

export function authorizeOrderSubmittedEventMessage(event: OrderSubmittedEvent, accessToken: string) {
    const claims = validateJwt(accessToken);
    if (!claims) {
        throw new ServiceError(401, 'jwt_error', 'The JWT failed verification');
    }
    if (claims.scope.indexOf('resume_orders') === -1) {
        throw new ServiceError(403, 'authorization_error', 'The token has insufficient scope');
    }
    const eventPayloadHash = hash.sha256(JSON.stringify(event.payload));
    if (claims.eventPayloadHash != eventPayloadHash) {
        throw new ServiceError(403, 'invalid_event_message', 'The event message contains an unexpected payload');
    }
}

Once these checks are made, the event data can then be considered digitally verified and can be trusted by receiving microservices. The access token received has also sent the user identity and claims to the microservice, which can be used for authorization and auditing.

A malicious party who gains access to the long-lived JWT cannot use it unexpectedly. If an attacker creates their own event message and attaches the JWT to it, the message will be rejected. If a user ID or money value is somehow changed in an event message, it will also be rejected. The worst an attacker can do is replay the event message to valid recipients, which microservices are expected to handle without unexpected results, as a no-op.

Further Reading

In this article, I have explained only the main ingredients for implementing zero-trust events. For further information on this design pattern and how to scale it to many microservices, take a look at the Zero Trust API Events architect article. To run a working solution on a development computer, using Apache Kafka as the message broker, see the Securing API Events using JWTs code example.