Backfill Beam pipeline with historical data - apache-beam

I have a Google Cloud Dataflow pipeline (written with the Apache Beam SDK) that, in its normal mode of operation, handles event data published to Cloud Pub/Sub.
In order to bring the pipeline state up to date, and to create the correct outputs, there is a significant amount of historical event data which must be processed first. This historical data is available via JDBC. In testing, I am able to use the JdbcIO.Read PTransform to read and handle all historical state, but I'd like to initialize my production pipeline using this JDBC event data, and then cleanly transition to reading events from Pub/Sub. This same process may happen again in the future if the pipeline logic is ever altered in a backward incompatible way.
Note that while this historical read is happening, new events are continuing to arrive into Pub/Sub (and these end up in the database also), so there should be a clean cutover from only historical events read from JDBC, and only newer events read from Pub/Sub.
Some approaches I have considered:
Have a pipeline that reads from both inputs, but filters data from JDBC before a certain timestamp, and from pub/sub after a certain timestamp. Once the pipeline is caught up deploy an update removing the JDBC input.
I don't think this will work because removal of an I/O transform is not backward compatible. Alternately, the JDBC part of the pipeline must stay there forever, burning CPU cycles for no good reason.
Write a one-time job that populates pub/sub with the entirety of the historical data, and then starts the main pipeline reading only from pub/sub.
This seems to use more pub/sub resources than necessary, AND I think newer data interleaved in the pipeline with much older data will cause watermarks to be advanced too early.
Variation of option #2 -- stop creating new events until the historical data is handled, to avoid messing up watermarks.
This requires downtime.
It seems like it would be a common requirement to backfill historical data into a pipeline, but I haven't been able to find a good approach to this.

Your first option, reading from a Bounded source (filtered to timestamp <= cutoff) and PubSub (filtered to timestamp > cutoff) should work well.
Because JDBC.Read() is a bounded source, it will be read all the data and then "finish" i.e. never produce any more data, advance its watermark to +infinity, and not be invoked again (so there's no concern about it consuming CPU cycles, even if it's present in your graph).

Related

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

Event sourcing - why a dedicated event store?

I am trying to implement event sourcing/CQRS/DDD for the first time, mostly for learning purposes, where there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka.
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage. Should I store my events in a dedicated store like RDBMS to feed into Kafka or should I feed them straight into Kafka?
Much of the literature on event-sourcing and cqrs comes from the [domain driven design] community; in its earliest form, CQRS was called DDDD... Distributed domain driven design.
One of the common patterns in domain driven design is to have a domain model ensuring the integrity of the data in your durable storage, which is to say, ensuring that there are no internal contradictions...
I am wondering why there needs to be a separate event store when it sounds like its purpose can be fulfilled by Kafka itself with its main features and log compaction or configuring log retention for permanent storage.
So if we want an event stream with no internal contradictions, how do we achieve that? One way is to ensure that only a single process has permission to modify the stream. Unfortunately, that leaves you with a single point of failure -- the process dies, and everything comes to an end.
On the other hand, if you have multiple processes updating the same stream, then you have risk of concurrent writes, and data races, and contradictions being introduced because one writer couldn't yet see what the other one did.
With an RDBMS or an Event Store, we can solve this problem by using transactions, or compare and swap semantics; and attempt to extend the stream with new events is rejected if there has been a concurrent modification.
Furthermore, because of its DDD heritage, it is common for the durable store to be divided into many very fine grained partitions (aka "aggregates"). One single shopping cart might reasonably have four streams dedicated to it.
If Kafka lacks those capabilities, then it is going to be a lousy replacement for an event store. KAFKA-2260 has been open for more than four years now, so we seem to be lacking the first. From what I've been able to discern from the Kakfa literature, it isn't happy about fine grained streams either (although its been a while since I checked, perhaps things have changed).
See also: Jesper Hammarbäck writing about this 18 months ago, and reaching similar conclusions to those expressed here.
Kafka can be used as a DDD event store, but there are some complications if you do so due to the features it is missing.
Two key features that people use with event sourcing of aggregates are:
Load an aggregate, by reading the events for just that aggregate
When concurrently writing new events for an aggregate, ensure only one writer succeeds, to avoid corrupting the aggregate and breaking its invariants.
Kafka can't do either of these currently, since 1 fails since you generally need to have one stream per aggregate type (it doesn't scale to one stream per aggregate, and this wouldn't necessarily be desirable anyway), so there's no way to load just the events for one aggregate, and 2 fails since https://issues.apache.org/jira/browse/KAFKA-2260 has not been implemented.
So you have to write the system in such as way that capabilities 1 and 2 aren't needed. This can be done as follows:
Rather than invoking command handlers directly, write them to
streams. Have a command stream per aggregate type, sharded by
aggregate id (these don't need permanent retention). This ensures that you only ever process a single
command for a particular aggregate at a time.
Write snapshotting code for all your aggregate types
When processing a command message, do the following:
Load the aggregate snapshot
Validate the command against it
Write the new events (or return failure)
Apply the events to the aggregate
Save a new aggregate snapshot, including the current stream offset for the event stream
Return success to the client (via a reply message perhaps)
The only other problem is handling failures (such as the snapshotting failing). This can be handled during startup of a particular command processing partition - it simply needs to replay any events since the last snapshot succeeded, and update the corresponding snapshots before resuming command processing.
Kafka Streams appears to have the features to make this very simple - you have a KStream of commands that you transform into a KTable (containing snapshots, keyed by aggregate id) and a KStream of events (and possibly another stream containing responses). Kafka allows all this to work transactionally, so there is no risk of failing to update the snapshot. It will also handle migrating partitions to new servers, etc. (automatically loading the snapshot KTable into a local RocksDB when this happens).
there is the idea of an event store and a message queue such as Apache Kafka, and you have events flowing from event store => Kafka Connect JDBC/Debezium CDC => Kafka
In the essence of DDD-flavoured event sourcing, there's no place for message queues as such. One of the DDD tactical patterns is the aggregate pattern, which serves as a transactional boundary. DDD doesn't care how the aggregate state is persisted, and usually, people use state-based persistence with relational or document databases. When applying events-based persistence, we need to store new events as one transaction to the event store in a way that we can retrieve those events later in order to reconstruct the aggregate state. Thus, to support DDD-style event sourcing, the store needs to be able to index events by the aggregate id and we usually refer to the concept of the event stream, where such a stream is uniquely identified by the aggregate identifier, and where all events are stored in order, so the stream represents a single aggregate.
Because we rarely can live with a database that only allows us to retrieve a single entity by its id, we need to have some place where we can project those events into, so we can have a queryable store. That is what your diagram shows on the right side, as materialised views. More often, it is called the read side and models there are called read-models. That kind of store doesn't have to keep snapshots of aggregates. Quite the opposite, read-models serve the purpose to represent the system state in a way that can be directly consumed by the UI/API and often it doesn't match with the domain model as such.
As mentioned in one of the answers here, the typical command handler flow is:
Load one aggregate state by id, by reading all events for that aggregate. It already requires for the event store to support that kind of load, which Kafka cannot do.
Call the domain model (aggregate root method) to perform some action.
Store new events to the aggregate stream, all or none.
If you now start to write events to the store and publish them somewhere else, you get a two-phase commit issue, which is hard to solve. So, we usually prefer using products like EventStore, which has the ability to create a catch-up subscription for all written events. Kafka supports that too. It is also beneficial to have the ability to create new event indexes in the store, linking to existing events, especially if you have several systems using one store. In EventStore it can be done using internal projections, you can also do it with Kafka streams.
I would argue that indeed you don't need any messaging system between write and read sides. The write side should allow you to subscribe to the event feed, starting from any position in the event log, so you can build your read-models.
However, Kafka only works in systems that don't use the aggregate pattern, because it is essential to be able to use events, not a snapshot, as the source of truth, although it is of course discussable. I would look at the possibility to change the way how events are changing the entity state (fixing a bug, for example) and when you use events to reconstruct the entity state, you will be just fine, snapshots will stay the same and you'll need to apply correction events to fix all the snapshots.
I personally also prefer not to be tightly coupled to any infrastructure in my domain model. In fact, my domain models have zero dependencies on the infrastructure. By bringing the snapshotting logic to Kafka streams builder, I would be immediately coupled and from my point of view it is not the best solution.
Theoretically you can use Kafka for Event Store but as many people mentioned above that you will have several restrictions, biggest of those, only able to read event with the offset in the Kafka but no other criteria.
For this reason they are Frameworks there dealing with the Event Sourcing and CQRS part of the problem.
Kafka is only part of the toolchain which provides you the capability of replaying events and back pressure mechanism that are protecting you from overload.
If you want to see how all fits together, I have a blog about it

Scheduling job with Apache NiFi by passing dynamic property values

I have created a NiFi workflow as shown below:
GenerateFlowFile --> Custom Processor --> LogAttribute
My custom processor has a property as start date. But the start date should change in each scheduled run based on the maximum end date from previous run. Basically looking for incremental data fetch from the server.
Could you please help, how this can be achieved in Apache NiFi?
Processor scheduling is usually left to the data flow manager configuring the processor into their flow. I recommend you let them schedule the processor, expecting it to run on a periodic basis.
But you can use Apache NiFi's State Manager feature to store data that tracks your incremental progress. You could then decide what action to take, if any, when the processor is triggered. If there is nothing to do, don't do anything.
The best examples of this are List* processors like ListFile. These processors typically store a timestamp of the file they last read, the use that timestamp to determine which newer files should be acted on, regardless of how frequently they are asked to check. It is likely that most executions of a List* processor will result in no output.
There are some examples of reading and persisting state data in the AbstractListProcessor class.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

How can running event handlers on production be done?

On production enviroments event numbers scale massively, on cases of emergency how can you re run all the handlers when it can take days if they are too many?
Depends on which sort of emergency you are describing
If the nature of your emergency is that your event handlers have fallen massively behind the writers (eg: your message consumers blocked, and you now have 48 hours of backlog waiting for you) -- not much. If your consumer is parallelizable, you may be able to speed things up by using a data structure like LMAX Disruptor to support parallel recovery.
(Analog: you decide to introduce a new read model, which requires processing a huge backlog of data to achieve the correct state. There isn't any "answer", except chewing through them all. In some cases, you may be able to create an approximation based on some manageable number of events, while waiting for the real answer to complete, but there's no shortcut to processing all events).
On the other hand, in cases where the history is large, but the backlog is manageable (ie - the write model wasn't producing new events), you can usually avoid needing a full replay.
In the write model: most event sourced solutions leverage an event store that supports multiple event streams - each aggregate in the write model has a dedicated stream. Massive event numbers usually means massive numbers of manageable streams. Where that's true, you can just leave the write model alone -- load the entire history on demand.
In cases where that assumption doesn't hold -- a part of the write model that has an extremely large stream, or a pieces of the read model that compose events of multiple streams, the usual answer is snapshotting.
Which is to say, in the healthy system, the handlers persist their state on some schedule, and include in the meta data an identifier that tracks where in the history that snapshot was taken.
To recover, you reload the snapshot, and the identifier. You then start the replay from that point (this assumes you've got an event store that allows you to start the replay from an arbitrary point in the history).
So managing recovery time is simply a matter of tuning the snapshotting interval so that you are never more than recovery SLA behind "latest". The creation of the snapshots can happen in a completely separate process. (In truth, your persistent snapshot store looks a lot like a persisted read model).