Druid how to drop duplicates in Kafka indexing service - druid

I am using DRUID with Kafka Indexing service. I am trying to understanding how it handles duplicate messages.
Example
Consider I have following message in Kafka Topic[1 partition only]
[Offset=100]
{
"ID":4,
"POINTS":1005,
"CREATED_AT":1616258354000000,
"UPDATED_AT":1616304119000000
}
Now consider after 24 hours, somehow same message is pushed again to topic.
[Offset=101]
{
"ID":4,
"POINTS":1005,
"CREATED_AT":1616258354000000,
"UPDATED_AT":1616304119000000
}
Note: Payload has not changed.
Actual:Now, In DRUID I see the same message again.
Expected: What I expect is since the payload has not changed the message should be ignored.
My timestamp column is CREATED_AT

Can you be sure that there will never be two unique events with the same timestamp other than duplicates? If so, you can try using rollup to eliminate the duplicates.
You can set that in the granularitySpec, and the queryGranularity will basically truncate all timestamps based on that granularity, and if ALL dimensions are identical, they get combined using the aggregation functions you set in the spec.
For the aggregation functions, you will want to use something like MAX or MIN, because SUM will add them up.
This will fail if you have multiple kafka partitions, but could be fixed with reindexing.

Related

How do you get the latest offset from a remote query to a Table in ksqlDB?

I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?
Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.
You used "materialized view" but I'm going to pretend I
heard "table". I have often used materialized views
in a historical reporting context, but not with live updates.
I assume that yours will behave similar to a "table".
I assume that all events, and DB rows, have timestamps.
Hopefully they are "mostly monotonic", so applying a
small safety window lets us efficiently process just
the relevant recent ones.
The crux of the matter is racing updates.
We need to prohibit races.
Each time an instance of a writer, such as your app,
comes up, assign it a new name.
Rolling a guid is often the most convenient way to do that,
or perhaps prepend it with a timestamp if sort order matters.
Ensure that each DB row mentions that "owning" name.
want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.
We will need a guaranteed monotonic column with an integer ID
or a timestamp. Let's call it ts.
Query m = max(ts).
Do a big query of records < m, slowly filling your hashmap.
Start consuming Stream A.
Do a small query of records >= m, updating the hashmap.
Continue to loop through subsequently arriving Stream A records.
Now you're caught up, and can maintain the hashmap in sync with DB.
Your business logic probably requires that you
treat DB rows mentioning the "self" guid
in a different way from rows that existed
prior to startup.
Think of it as de-dup, or ignoring replayed rows.
You may find offsetsForTimes() useful.
There's also listOffsets().

Why ksql comsumer all message from kafka even I add limit 1 to query

I run this query
select * from USER_EVENTS emit changes limit 1;
USER_EVENTS is a stream.
Before this i set auto.offset.reset to earliest.
This query run slowly. I don't know why.
And then i show queries to check consumer id of above query and search it in kafka connect.
And i find out query need fetch all message in topic, although i only need one row.
Is that true, and why it need fetch all ? I think fetch one is enough because i had add limit 1 to query.
Topic behind USER_EVENTS have ~1 m message.
I use ksqlServer 6.1.0 and the same for ksqlCli.
This is what ksqldb is supposed to do. Consume the entire stream and materialize a table from that. Your query even says
emit changes
which means it will go through your messages one by one and update the table in near real time. LIMIT 1 only means, that it will show a single message (and update that) instead of showing a growing table, but it consumes the stream either way.
The alternative would be
emit final
which would only show the final result, but still go trough the entire stream.
At least to my knowledge, this is not possible with ksqldb.
If you just need to look at one message interactively, I recommend to use a CLI tool like kcat or https://github.com/birdayz/kaf which all have a config option to consume only a single message.
If you need it programmatically, I would probably try to write a consumer by hand and simple call poll() once instead of the standard poll loop.
If you want "hacky" quickfix, you could also try to set
SET 'auto.offset.reset'='earliest';
for your query in ksqldb. This will still go through the entire stream, but start with the newest available message. So it would ignore everything that is in the topic.

Is it possible to detect and drop duplicate data using ksql

I have a simple question whether can we detect and drop duplicates in streaming data on kafka topic using KSQL.
By default, tables are de-duped on keys. A new record for the same key will overwrite old events. If you need to "detect" and "process" the old data, as new events come in, then KSQL cannot do this.
If you need distinct values rather than by-key, you can create a table against some stream of events and filtering on HAVING COUNT(field) = 1 over a time window, which is the best you can do there. Ref - https://kafka-tutorials.confluent.io/finding-distinct-events/ksql.html
If you need indefinite time windows to ensure you only process a certain field once, then you'll want to use an external database, and optionally an internal cache, to perform lookups against. This would need to be done with a regular consumer, or Kafka Streams.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

Initial load of Kafka stream data with windowed join

I am using a Windowed Join between two streams, let's say a 7 day window.
On initial load, all records in the DB (via kafka connect source connector) are being loaded to the streams. It seems then that ALL records end up in the window state store for those first 7 days as the producer/ingested timestamps are all in current time vs. a field (like create_time) that might be in the message value.
Is there a recommended way to balance the initial load against the Windows of the join?
Well, the question is what records do you want to join to each other? And what timestamp the source connector sets as record timestamp (might also depend on the topic configuration, [log.]message.timestamp.type.
The join is executed based on whatever the TimestampExtractor returns. By default, that is the record timestamp. If you want to base the join on some other timestamp, a custom timestampe extractor is the way to go.
If you want to get processing time semantics, you may want to use the WallclockTimestampExtractor though.