How can I measure the propagation latency of DynamoDB Streams? - streaming

I'm using DynamoDB Streams + Kinesis Client Library (KCL).
How can I measure latency between when an event was created in a stream and when it was processed on KCL side?
As I know, KCL's MillisBehindLatest metric is specific to Kinesis Streams(not DynamoDB streams).
approximateCreationDateTime record attribute has a minute-level approximation, which is not acceptable for monitoring in sub-second latency systems.
Could you please help with some useful metrics for monitoringDynamoDB Streams latency?

You can change the way you do writes in your application to allow your application to track the propagation delay of mutations in the table's stream. For example, you could always update a 'last_updated=' timestamp attribute when you create and update items. That way, when your creations and updates appear in the stream, you can estimate the propagation delay by subtracting the current time from last_updated in the NEW_IMAGE of the stream record.
Because deletions do not have a NEW_IMAGE in stream records, your deletes would need to take place in two steps:
logical deletion where you write the 'logically_deleted='
timestamp to the item and
physical deletion where you actually call DeleteItem immediately following 1.
Then, you would use the same math as for creations and updates, only differences being that you would use the OLD_IMAGE when processing deletions and you would need to subtract at least around 10ms to account for the time it takes to perform the logical delete (step 1).


Scale read from change streams in documentdb

I am utilizing change streams from documentDB to read timely sequenced events using lambda, event bridge to trigger event every 10min to invoke lambda and to archive the data to S3. Is there a way to scale the read from change stream using resume token and polling model? If a single lambda tries to read from change stream to archive then my process is falling way behind. As our application writes couple of millions during peak period my archival process is able to archive atmost 500k records to S3. Is there a way to scale this process? Running parallel lambda might not work as this will lead to racing condition.
can't you use step-functions? your event bridge fires the lambda which is a step-function, then it can keep the state while archiving the records.
I am not certain about documentDB, but I believe in MongoDB you can create a change stream with a filter. In this way, you can have multiple change streams, each acting on a portion (filter) of data. This allows multiple change streams to work concurrently on one cluster.

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

How to replay in a deterministic way in CQRS / event-sourcing?

In CQRS / ES based systems, you store events in an event-store. These events refer to an aggregate, and they have an order with respect to the aggregate they belong to. Furthermore, aggregates are consistency / transactional boundaries, which means that any transactional guarantees are only given on a per-aggregate level.
Now, supposed I have a read model which consumes events from multiple aggregates (which is perfectly fine, AFAIK). To be able to replay the read model in a deterministic way, the events need some kind of global ordering, across aggregates – otherwise you wouldn't know whether to replay events for aggregate A before or after the ones for B, or how to intermix them.
The simplest solution to achieve this is by using a timestamp on the events, but typically timestamps are not fine-granular enough (or, to put it another way, not all databases are created equal). Another option is to use a global sequence, but this is bad performance-wise and hinders scaling.
How do you solve this issue? Or is my basic assumption, that replays of read models should be deterministic, wrong?
I see these options:
Global sequence
if your database allows it, you can use timestamp+aggregateId+aggregateVersion as an index. This usually doesnt work well in the distributed database case.
in the distributed database you can use vector clock to get a global sequence without having a lock.
Event sequence inside each read model. You can literally store all events in the read model and sort them as you want before applying a projection function.
Allow non-determinism and deal with it. For instance, in your example, if there is no group when add_user event arrives - just create an empty group record to the read model and add a user. And when create_group event arrives - update that group record.
After all, you have checked in UI and/or command handler that there
is a group with this aggregateId, right?
How do you solve this issue?
It's known issue, and of course nor simple timestamps, nor global sequence, nor event naïve methods will not help.
Use vector clock with weak timestamp to enumerate your events and vector cursor to read them. That guarantees some stable deterministic order to intermix events between aggregates. This will work even if each thread has clock synchronization gap, which is regular use case for database clusters, because perfect timestamp synchronization is impossible.
Also this automatically gives possibility to seamless mix reading events from event store and event bus later, and excludes any database locks inter different aggregates events.
Algorithm draft:
1) Determine real quantity of simultaneous transactions in your database, e.g. maximum number of workers in cluster.
Since every event had been written in only one transaction in one thread, you can determine it's unique id as tuple (thread number, thread counter), where thread counter is amount of transactions processed on current thread.
Calculate event weak timestamp as MAX(thread timestamp, aggregate timestamp), where aggregate timestamp is timestamp of last event for current aggregate.
2) Prepare vector cursor for reading events via thread number boundary. Read events from each thread sequentially until timestamp gap exceed allowed value. Allowed weak timestamp gap is trade between event reading performance and preserving native events order.
Minimal value is cluster threads synchronization time delta, so events are arrived in native aggregate intermix order. Maximum value is infinity, so events will be spitted by aggregate. When using RDBMS like postgres, that value can be automatically determined via smart SQL query.
You can see referent implementation for PostgreSQL database for saving events and loading events. Saving events performance is about 10000 events per second for 4GB RAM RDS Postgres cluster.

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

GCP Dataflow: System Lag for streaming from Pub/Sub IO

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric.
1) What does system lag exactly means?
The maximum time that an item of data has been awaiting processing
Above is what we see in GCP Console when we hit information icon. What does an item of data mean in this case? Stream processing has concept of Windowing, event time vs processing time, watermark, etc. When is an item considered awaiting to be processed? For example is it simply when the message arrives regardless of its state?
2) What is the optimum threshold for this metric?
We try to keep this metric as low as possible, but we don't have any recommendation on how low we should keep it. For example do we have some recommendation such as keeping system lag between 20s to 30s is optimum.
3)How does system lag implicates sinks
How does system lag affect latency of the event itself?
Depending on the pipeline being executed there are a number of places that elements may be queued up awaiting processing. This is typically when the elements are passed between machines, such as within a GroupByKey, although the PubSub source also reflects the oldest unacked element.
For a given step (sinks included) "System Lag" measures the age of the oldest element in the closest input queue to that step.
It is not unusual for there to be spikes in this measure -- elements are pulled off the queue after they are processed, so if many new elements are delivered it may take a while before the queue is back to a manageable size. What is important is that the system lag goes back down after these spikes.
The latency of a sink depends on several factors:
The rate that elements arrive in the pipeline limits the rate the input watermark advances.
The configuration of windowing and triggers affect how long the pipeline must wait before emitting a given window.
System lag is a measure of how much added delay is currently being introduced by code executing within the pipeline.
It is likely easier to look at the "Data Watermark" of the sink, which reports up to what point in (event) time the sink has been processed.