So I need to have a GlobalKTable containing the aggregation of several messages across many instances. Right now, my single instance KTable setup looks something like this:
final KTable<String, Double> aggregatedMetrics = eventStream
.groupByKey(Serdes.String(), jsonSerde)
.aggregate(
() -> 0d,
new MetricsAggregator(),
Serdes.Double(),
LOCAL_METRICS_STORE_NAME);
Obviously, this doesn't scale since each instance only has the updated metrics for the messages it has received, not for all of the messages received by all the other instances. I was thinking of using this:
final KStreamBuilder builder = new KStreamBuilder();
builder.globalTable(METRIC_CHANGES_TOPIC, METRICS_STORE_NAME);
and then just streaming updates to my aggregatedMetrics KTable to the METRIC_CHANGES_TOPIC, which would update the global table. However, each instance would just be overwriting the other instances' aggregations on each update to the global table.
Is there any way I can do a global aggregation?
The solution sound correct to me.
This does not sound correct:
However, each instance would just be overwriting the other instances' aggregations on each update to the global table.
Note, that aggregations are done key-based. Thus, different instances will aggregate on different keys, and thus, each instance will just update its own keys in the GlobalKTable.
Related
I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?
Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.
You used "materialized view" but I'm going to pretend I
heard "table". I have often used materialized views
in a historical reporting context, but not with live updates.
I assume that yours will behave similar to a "table".
I assume that all events, and DB rows, have timestamps.
Hopefully they are "mostly monotonic", so applying a
small safety window lets us efficiently process just
the relevant recent ones.
The crux of the matter is racing updates.
We need to prohibit races.
Each time an instance of a writer, such as your app,
comes up, assign it a new name.
Rolling a guid is often the most convenient way to do that,
or perhaps prepend it with a timestamp if sort order matters.
Ensure that each DB row mentions that "owning" name.
want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.
We will need a guaranteed monotonic column with an integer ID
or a timestamp. Let's call it ts.
Query m = max(ts).
Do a big query of records < m, slowly filling your hashmap.
Start consuming Stream A.
Do a small query of records >= m, updating the hashmap.
Continue to loop through subsequently arriving Stream A records.
Now you're caught up, and can maintain the hashmap in sync with DB.
Your business logic probably requires that you
treat DB rows mentioning the "self" guid
in a different way from rows that existed
prior to startup.
Think of it as de-dup, or ignoring replayed rows.
You may find offsetsForTimes() useful.
There's also listOffsets().
I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.
We have a streams topology that will work on multiple machines. We are storing time-windowed aggregation results into state stores.
Since state stores are storing local data, aggregation should be done on another topic for overall aggregation, I think.
But it seems like I am missing something because none of the examples do the overall aggregations on another KStream or Processor.
Do we need to use the groupBy logic for storing overall aggregation, or use a GlobalKtable or just implement our own merger code somewehere?
What is the correct architecture for this?
In below code, I have tried to group all the messages coming to the processor with a constant key to store the overall aggregation on just one machine, but it would lose the parallelism that Kafka supplies, I think.
dashboardItemProcessor = streamsBuilder.stream("Topic25", Consumed.with(Serdes.String(), eventSerde))
.filter((key, event) -> event != null && event.getClientCreationDate() != null);
dashboardItemProcessor.map((key, event) -> KeyValue.pair(key, event.getClientCreationDate().toInstant().toEpochMilli()))
.groupBy((key, event) -> "count", Serialized.with(Serdes.String(), Serdes.Long()))
.windowedBy(timeWindow)
.count(Materialized.as(dashboardItemUtil.getStoreName(itemId, timeWindow)));
In below code, I have tried to group all the messages coming to the processor with a constant key to store the overall aggregation on just one machine, but it would lose the parallelism that Kafka supplies, I think.
This seems to be the right approach. And yes, you loos parallelism, but that is how an global aggregation work. In the end, one machine must compute it...
What you could improve though, is to do a two step approach: ie, first aggregate by "random" keys in parallel, and use a second step with only one key to "merge" the partial aggregates into a single one. This way, some parts of the computation are parallelized and only the final step (on hopefully reduced data load) is non-parallel. Using Kafka Streams, you need to implement this approach "manually".
Under what circumstances does a Kafka Streams program need to serialize/deserialize? Suppose we have the following simple program:
KStream<k,v> stream = ...;
Kstream<k,v> stream2 = stream.filter( predicateA )
Kstream<k,v> stream3 = stream2.filter( predicateB)
stream3.to( topic );
Very specifically, between the two invocations of filter, do the "k" and "v" get serialized/deserialized, or do individual data points get passed as native objects?
Kafka Streams tries to pass Java object around if this is possible to avoid de/serialization overhead.
Only if data is read or written into a topic or a store, it will be de/serialized.
All operators that might need to de/serialize data, allow you to specify a key and value Serde—this is a good indicator which operator might de/serialize data and which don't.
While processing a stream of Avro messages through Kafka and Spark, I am saving the processed data as documents in a ElasticSearch index.
Here's the code (simplified):
directKafkaStream.foreachRDD(rdd ->{
rdd.foreach(avroRecord -> {
byte[] encodedAvroData = avroRecord._2;
MyType t = deserialize(encodedAvroData);
// Creating the ElasticSearch Transport client
Settings settings = Settings.builder()
.put("client.transport.ping_timeout", 5, TimeUnit.SECONDS).build();
TransportClient client = new PreBuiltTransportClient(settings)
.addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));
IndexRequest indexRequest = new IndexRequest("index", "item", id)
.source(jsonBuilder()
.startObject()
.field("name", name)
.field("timestamp", new Timestamp(System.currentTimeMillis()))
.endObject());
UpdateRequest updateRequest = new UpdateRequest("index", "item", id)
.doc(jsonBuilder()
.startObject()
.field("name", name)
.field("timestamp", new Timestamp(System.currentTimeMillis()))
.endObject())
.upsert(indexRequest);
client.update(updateRequest).get();
client.close();
Everything works as expected; the only problem is performance: saving to ES requires some time, and I suppose that this is due to the fact that I open/close an ES Transport client for each RDD. Spark documentation suggests that this approach is quite correct: as soon as I understand, the only possible optimisation is using rdd.foreachPartition, but I only have one partition, so I am not sure that this would be beneficial.
Any other solution to achieve better performance?
Because you create new connect whenever process a record of RDD.
So, I think use foreachPartition will make better performance regardless of only one partition, because it help you bring your ES connection instance outside, reuse it in the loop.
I would stream the processed messages back onto a separate Kafka topic, and then use Kafka Connect to land them to Elasticsearch. This decouples your Spark-specific processing from getting the data into Elasticsearch.
Example of it in action: https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/