Scale read from change streams in documentdb - mongodb

I am utilizing change streams from documentDB to read timely sequenced events using lambda, event bridge to trigger event every 10min to invoke lambda and to archive the data to S3. Is there a way to scale the read from change stream using resume token and polling model? If a single lambda tries to read from change stream to archive then my process is falling way behind. As our application writes couple of millions during peak period my archival process is able to archive atmost 500k records to S3. Is there a way to scale this process? Running parallel lambda might not work as this will lead to racing condition.

can't you use step-functions? your event bridge fires the lambda which is a step-function, then it can keep the state while archiving the records.

I am not certain about documentDB, but I believe in MongoDB you can create a change stream with a filter. In this way, you can have multiple change streams, each acting on a portion (filter) of data. This allows multiple change streams to work concurrently on one cluster.

Related

Backfill Beam pipeline with historical data

I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub.
In order to bring the pipeline state up to date, and to create the correct outputs, there is a significant amount of historical event data which must be processed first. This historical data is available via JDBC. In testing, I am able to use the JdbcIO.Read PTransform to read and handle all historical state, but I'd like to initialize my production pipeline using this JDBC event data, and then cleanly transition to reading events from Pub/Sub. This same process may happen again in the future if the pipeline logic is ever altered in a backward incompatible way.
Note that while this historical read is happening, new events are continuing to arrive into Pub/Sub (and these end up in the database also), so there should be a clean cutover from only historical events read from JDBC, and only newer events read from Pub/Sub.
Some approaches I have considered:
Have a pipeline that reads from both inputs, but filters data from JDBC before a certain timestamp, and from pub/sub after a certain timestamp. Once the pipeline is caught up deploy an update removing the JDBC input.
I don't think this will work because removal of an I/O transform is not backward compatible. Alternately, the JDBC part of the pipeline must stay there forever, burning CPU cycles for no good reason.
Write a one-time job that populates pub/sub with the entirety of the historical data, and then starts the main pipeline reading only from pub/sub.
This seems to use more pub/sub resources than necessary, AND I think newer data interleaved in the pipeline with much older data will cause watermarks to be advanced too early.
Variation of option #2 -- stop creating new events until the historical data is handled, to avoid messing up watermarks.
This requires downtime.
It seems like it would be a common requirement to backfill historical data into a pipeline, but I haven't been able to find a good approach to this.
Your first option, reading from a Bounded source (filtered to timestamp <= cutoff) and PubSub (filtered to timestamp > cutoff) should work well.
Because JDBC.Read() is a bounded source, it will be read all the data and then "finish" i.e. never produce any more data, advance its watermark to +infinity, and not be invoked again (so there's no concern about it consuming CPU cycles, even if it's present in your graph).

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

Event sourcing - why a dedicated event store?

I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?
Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.
Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).
there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.
Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it

Is Event Sourcing applicable for batch inputs?

I have a use case where the inputs to the application comes in batches of XML files. For example, a nightly batch of bank transactions. I am trying to see if I can use event sourcing to create a log of events. Based on what I read so far, the examples seems to be based on user driven input (click streams, updates from a user interface etc.,). Is event sourcing using a distributed log mechanism(like Kafka) a valid approach for batch/file based inputs?
Below is the approach I would like to take:
Accept input as a batch in file/xml
Run some basic validations in the memory.
Convert the batch input into a series of events
Write the event log to a Kafka topic(s).
Use the event log to store the data into the database, send the events
to a search engine, update caches, run spark jobs to do aggregations
etc.,
Repeat the process for other incoming batches.
If this approach is not efficient, what other options are available for distributed processing of such inputs?
Are your inputs coming from something that looks like an event storage? I.e. a database that acts as an immutable source of truth, of append only events.
If that is the case, you have the foundation to use event sourcing, and additionally CQRS. (They're not the same thing)
What you would have to realize is that the so called write side / command side... has already been done for you.
The incoming batch of XML files with transactions... each transaction is an event already. It doesn't sound like you need to convert these to events, to then put these into Kafka. You can just map these to something you can put into Kafka, and then all subscribers of the topics can do stuff accordingly.
Effectively you would be implementing the read side of Event Sourcing + CQRS.
In practical terms, unless you are going to be doing things on the write side (where the xml files are generated / where user input is received)... I wouldn't worry too much about the subtleties of event sourcing as it relates to DDD and CQRS. I would simply think of what you're doing as a way to distribute your data to multiple services.
And make sure to consider how caches, search engines, etc. will only be updated whenever you get those XML files.
If each individual event in these xml files has a timestamp then you can think of the output to Kafka as just a steam of late arriving events. Kafka allows you to set the event time on these messages to be the timestamp of the event rather than the time it was ingested to Kafka. In that way, any downstream processing apps like Kafka Streams can put the event into the right temporal context and aggregate into the proper time windows or session windows or even join with other realtime inputs

How can I measure the propagation latency of DynamoDB Streams?

I'm using DynamoDB Streams + Kinesis Client Library (KCL).
How can I measure latency between when an event was created in a stream and when it was processed on KCL side?
As I know, KCL's MillisBehindLatest metric is specific to Kinesis Streams(not DynamoDB streams).
approximateCreationDateTime record attribute has a minute-level approximation, which is not acceptable for monitoring in sub-second latency systems.
Could you please help with some useful metrics for monitoringDynamoDB Streams latency?
You can change the way you do writes in your application to allow your application to track the propagation delay of mutations in the table's stream. For example, you could always update a 'last_updated=' timestamp attribute when you create and update items. That way, when your creations and updates appear in the stream, you can estimate the propagation delay by subtracting the current time from last_updated in the NEW_IMAGE of the stream record.
Because deletions do not have a NEW_IMAGE in stream records, your deletes would need to take place in two steps:
logical deletion where you write the 'logically_deleted='
timestamp to the item and
physical deletion where you actually call DeleteItem immediately following 1.
Then, you would use the same math as for creations and updates, only differences being that you would use the OLD_IMAGE when processing deletions and you would need to subtract at least around 10ms to account for the time it takes to perform the logical delete (step 1).