Checkpointing with Azure Event Hubs

Here are some tips for you to create a checkpointing solution with Event Hubs provisioned as part of your subscription.

Although the actual implementation and terminology might differ slightly between different platforms and languages, the logic behind checkpointing with Azure Event Hubs is largely similar.

This is the official explanation for checkpointing, according to Microsoft's documentation:

Checkpointing is a process by which readers mark or commit their position within a partition event sequence. Checkpointing is the responsibility of the consumer and occurs on a per-partition basis within a consumer group. This responsibility means that for each consumer group, each partition reader must keep track of its current position in the event stream, and can inform the service when it considers the data stream complete.

To find out more about checkpointing in general please refer to this page.

To understand checkpointing, there are several key terms that is important to take note of within the context of Flight Info Alerts. When retrieving a flight change event from the event hub, you will also obtain the following metadata: Partition ID, Offset, Sequence Number, and the EnqueueTime of the event. The following sections would describe those values and their importance.

Partitions

Within the Event Hubs provisioned by OAG as part of Flight Info Alerts, the flight change events streamed into the Event Hubs will be split into multiple partitions. In Flight Info Alerts, events are allocated to the partitions according to the Flight Number of the event, irrespective of other details such as the carrier code or departure date. Each event you consume therefore will have a partition ID attached as metadata.

This also means that the true chronological sequence of all events streamed by OAG is split into multiple partitions. However, the chronological sequence of a specific flight number will still be contiguous within each partition. Partitions are allocated to increase throughput and stability when extracting data by our customers, since each partition handles receipt of events separately from one another.

Before starting to consume events from the event hub, it is a good idea to inspect the event hub to get an idea of how many partitions there are so you can organise the extraction of events in your application.

Offsets

This is Microsoft's definition of an offset within the event hub/partition context:

An offset is the position of an event within a partition. You can think of an offset as a client-side cursor. The offset is a byte numbering of the event. This offset enables an event consumer (reader) to specify a point in the event stream from which they want to begin reading events. You can specify the offset as a timestamp or as an offset value. Consumers are responsible for storing their own offset values outside of the Event Hubs service. Within a partition, each event includes an offset.

Offsets are one of the methods in which a consumer client can use to determine the starting position of receiving events from a specific partition. Offsets are unique within the partition only.

Sequence Number

The Sequence Number of an event is another value that is part of the metadata of each event within the event hub. Sequence numbers are 64-bit integers that increments by 1 for each event that goes into each partition. For example, the first event in each partition would have a sequence number of 0. The next event that goes into the partition would have a sequence number of 1, and so on. Sequence numbers will increment until all available numbers within the 64bit limit is exhausted and will then revert back to 0.

Sequence numbers are unique within each partition and is also one of the values customers can use to determine the starting position of a consumer client.

Enqueue Time

The enqueue time property of an event is the datetime value in UTC of when an event was pushed into the partition within the event hub. This value can also be used to select a starting position of a consumer client. However, its value as a checkpoint for restarting or continuing a consumption process is less than offsets or sequence numbers because at high throughputs datetime values may not be sensitive enough to capture the actual event.

Checkpointing Concepts

Because flight change events are sent into different partitions and Offsets/Sequence numbers being unique to each partition, the recommendation is that consumer clients store checkpoints for each partition as events are consumed.

For each checkpoint, customers should store either the offset, sequence number, or enqueue times as well as the partition ID of the event in a location accessible by your application. Preferably you would store all of the values above.

In order to increase performance, instead of creating a checkpoint for each event, an alternative method is to do checkpoints for every x number of events.

When you do restart your application or continue consumption, all you will need to do is to use the partition ID and one of the above values (offset, sequence number or enqueue time) to begin from where you left off.

Checkpointing as suggested by Microsoft Azure

If you have looked through Microsoft's documentation, the default method (because of pre-built code packages) to implement checkpointing will be using Azure Blob Storage, which requires a separate Azure subscription.

It is not necessary to have a blob storage account to do checkpointing with event hubs. The only requirement is that you are able to store the checkpoint information as stated above in a location accessible by your application, in a way that fits the volume and latency of data flowing in.

As an alternative, Microsoft's SDK now also includes Redis as an option for checkpointing. Based on Redis' website:

Redis is an in-memory data store used by millions of developers as a cache, vector database, document database, streaming engine, and message broker. Redis has built-in replication and different levels of on-disk persistence. It supports complex data types (e.g., strings, hashes, lists, sets, sorted sets, and JSON), with atomic operations defined on those data types.

Redis offers competitive pricing and similar, if not better performance for checkpointing than utilizing blob storage and is able to connect to major cloud services such as Azure, AWS and GCS.

Do note that checkpointing, while useful, is entirely optional. It can be thought of as insurance in case something goes wrong that will help you quickly get back on your feet, but does not affect the actual consumption of events, which is able to proceed without a checkpointing process.