Is it possible to detect and drop duplicate data using ksql

Is it possible to detect and drop duplicate data using ksql - apache-kafka

I have a simple question whether can we detect and drop duplicates in streaming data on kafka topic using KSQL.

By default, tables are de-duped on keys. A new record for the same key will overwrite old events. If you need to "detect" and "process" the old data, as new events come in, then KSQL cannot do this.
If you need distinct values rather than by-key, you can create a table against some stream of events and filtering on HAVING COUNT(field) = 1 over a time window, which is the best you can do there. Ref - https://kafka-tutorials.confluent.io/finding-distinct-events/ksql.html
If you need indefinite time windows to ensure you only process a certain field once, then you'll want to use an external database, and optionally an internal cache, to perform lookups against. This would need to be done with a regular consumer, or Kafka Streams.

Related

How do you get the latest offset from a remote query to a Table in ksqlDB?

I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?
Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.

You used "materialized view" but I'm going to pretend I
heard "table". I have often used materialized views
in a historical reporting context, but not with live updates.
I assume that yours will behave similar to a "table".
I assume that all events, and DB rows, have timestamps.
Hopefully they are "mostly monotonic", so applying a
small safety window lets us efficiently process just
the relevant recent ones.
The crux of the matter is racing updates.
We need to prohibit races.
Each time an instance of a writer, such as your app,
comes up, assign it a new name.
Rolling a guid is often the most convenient way to do that,
or perhaps prepend it with a timestamp if sort order matters.
Ensure that each DB row mentions that "owning" name.
want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.
We will need a guaranteed monotonic column with an integer ID
or a timestamp. Let's call it ts.
Query m = max(ts).
Do a big query of records < m, slowly filling your hashmap.
Start consuming Stream A.
Do a small query of records >= m, updating the hashmap.
Continue to loop through subsequently arriving Stream A records.
Now you're caught up, and can maintain the hashmap in sync with DB.
Your business logic probably requires that you
treat DB rows mentioning the "self" guid
in a different way from rows that existed
prior to startup.
Think of it as de-dup, or ignoring replayed rows.
You may find offsetsForTimes() useful.
There's also listOffsets().

Initial load of Kafka stream data with windowed join

I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?

Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.

With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

Filter read access events in Debezium

We are using Debezium + PostgreSQL.
Notice that we get 4 types of events for create, read, update and delete - c, r, u and d.
The read type of event is unused for our application. Actually, I could not think of an use case for the 'r' events unless we are doing auditing or mirroring the activities of a transaction.
We are facing difficulties scaling & I suspect it is because of network getting hogged by read type of events.
How do we filter out those events in postgreSQL itself?
I got a clue from one of the contributors to use snapshot.mode. I guess something that has to be done when Debezium creates a snapshot. I am unable to figure out how to do that.

It is likely that your database has existed for some time and contains data and changes that have been purged from the logical decoding logs. If you then start using the Debezium PostgreSQL connector to start capturing changes into Kafka, the question becomes what a consumer of the events in Kafka should be able to see.
One scenario is that a consumer should be able to see events for all rows in the database, even those that existed prior to the start of CDC. For example, this allows a consumer to completely reproduce/replicate all of the existing data and keep that data in sync over time. To accomplish this, the Debezium PostgreSQL connector starts up can begin by creating a snapshot of the database contents before it starts capturing the changes. This is done atomically, so that even if the snapshot process takes a while to run, the connector will still see all of the events that occurred since the snapshot process was started. These events are represented as "read" events, since in effect the connector is simply reading the existing rows. However, they are identical to "insert" events, so any application could treat reads and inserts in the same way.
On the other hand, if consumers of the events in Kafka do not need to see events for all existing rows, then the connector can be configured to avoid the snapshot and to instead begin by capturing the changes. This may be useful in some scenarios where the entire database state need not be found in Kafka, but instead the goal is to simply capture the changes that are occurring.
The Debezium PostgreSQL connector will work either way, so you should use the approach that works for how you're consuming the events.

Oracle change-data-capture with Kafka best practices

I'm working on a project where we need to stream real-time updates from Oracle to a bunch of systems (Cassandra, Hadoop, real-time processing, etc). We are planing to use Golden Gate to capture the changes from Oracle, write them to Kafka, and then let different target systems read the event from Kafka.
There are quite a few design decisions that need to be made:
What data to write into Kafka on updates?
GoldenGate emits updates in a form of record ID, and updated field. These changes can be writing into Kafka in one of 3 ways:
Full rows: For every field change, emit the full row. This gives a full representation of the 'object', but probably requires making a query to get the full row.
Only updated fields: The easiest, but it's kind of a weird to work with as you never have a full representation of an object easily accessible. How would one write this to Hadoop?
Events: Probably the cleanest format ( and the best fit for Kafka), but it requires a lot of work to translate db field updates into events.
Where to perform data transformation and cleanup?
The schema in the Oracle DB is generated by a 3rd party CRM tool, and is hence not very easy to consume - there are weird field names, translation tables, etc. This data can be cleaned in one of (a) source system, (b) Kafka using stream processing, (c) each target system.
How to ensure in-order processing for parallel consumers?
Kafka allows each consumer to read a different partition, where each partition is guaranteed to be in order. Topics and partitions need to be picked in a way that guarantees that messages in each partition are completely independent. If we pick a topic per table, and hash record to partitions based on record_id, this should work most of the time. However what happens when a new child object is added? We need to make sure it gets processed before the parent uses it's foreign_id

One solution I have implemented is to publish only the record id into Kafka and in the Consumer, use a lookup to the origin DB to get the complete record. I would think that in a scenario like the one described in the question, you may want to use the CRM tool API to lookup that particular record and not reverse engineer the record lookup in your code.
How did you end up implementing the solution ?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse