Why ksql comsumer all message from kafka even I add limit 1 to query - apache-kafka

I run this query
select * from USER_EVENTS emit changes limit 1;
USER_EVENTS is a stream.
Before this i set auto.offset.reset to earliest.
This query run slowly. I don't know why.
And then i show queries to check consumer id of above query and search it in kafka connect.
And i find out query need fetch all message in topic, although i only need one row.
Is that true, and why it need fetch all ? I think fetch one is enough because i had add limit 1 to query.
Topic behind USER_EVENTS have ~1 m message.
I use ksqlServer 6.1.0 and the same for ksqlCli.

This is what ksqldb is supposed to do. Consume the entire stream and materialize a table from that. Your query even says
emit changes
which means it will go through your messages one by one and update the table in near real time. LIMIT 1 only means, that it will show a single message (and update that) instead of showing a growing table, but it consumes the stream either way.
The alternative would be
emit final
which would only show the final result, but still go trough the entire stream.
At least to my knowledge, this is not possible with ksqldb.
If you just need to look at one message interactively, I recommend to use a CLI tool like kcat or https://github.com/birdayz/kaf which all have a config option to consume only a single message.
If you need it programmatically, I would probably try to write a consumer by hand and simple call poll() once instead of the standard poll loop.
If you want "hacky" quickfix, you could also try to set
SET 'auto.offset.reset'='earliest';
for your query in ksqldb. This will still go through the entire stream, but start with the newest available message. So it would ignore everything that is in the topic.

Related

How do you get the latest offset from a remote query to a Table in ksqlDB?

I have an architecture where I would like to query a ksqlDB Table from a Kafka stream A (created by ksqlDB). On startup, Service A will load in all the data from this table into a hashmap, and then afterward it will start consuming from Kafka Stream A and act off any events to update this hashmap. I want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the table, and when I started consuming off Kafka Stream A. Is there a way that I can retrieve the latest offset that my query to the table is populated by so that I can use that offset to start consuming from Kafka Stream A?
Another thing to mention is that we have hundreds of instances of our app going up and down so reading directly off the Kafka stream is not an option. Reading an entire stream worth of data every time our apps come up is not a scalable solution. Reading in the event streams data into a hashmap on the service is a hard requirement. This is why the ksqlDB table seems like a good option since we can get the latest state of data in the format needed and then just update based off of events from the stream. Kafka Stream A is essentially a CDC stream off of a MySQL table that has been enriched with other data.
You used "materialized view" but I'm going to pretend I
heard "table". I have often used materialized views
in a historical reporting context, but not with live updates.
I assume that yours will behave similar to a "table".
I assume that all events, and DB rows, have timestamps.
Hopefully they are "mostly monotonic", so applying a
small safety window lets us efficiently process just
the relevant recent ones.
The crux of the matter is racing updates.
We need to prohibit races.
Each time an instance of a writer, such as your app,
comes up, assign it a new name.
Rolling a guid is often the most convenient way to do that,
or perhaps prepend it with a timestamp if sort order matters.
Ensure that each DB row mentions that "owning" name.
want to avoid any race condition in which I would miss any events that were propagated to Kafka Stream A in the time between I queried the materialized view, and when I started consuming off Kafka Stream A.
We will need a guaranteed monotonic column with an integer ID
or a timestamp. Let's call it ts.
Query m = max(ts).
Do a big query of records < m, slowly filling your hashmap.
Start consuming Stream A.
Do a small query of records >= m, updating the hashmap.
Continue to loop through subsequently arriving Stream A records.
Now you're caught up, and can maintain the hashmap in sync with DB.
Your business logic probably requires that you
treat DB rows mentioning the "self" guid
in a different way from rows that existed
prior to startup.
Think of it as de-dup, or ignoring replayed rows.
You may find offsetsForTimes() useful.
There's also listOffsets().

Is it possible to detect and drop duplicate data using ksql

I have a simple question whether can we detect and drop duplicates in streaming data on kafka topic using KSQL.
By default, tables are de-duped on keys. A new record for the same key will overwrite old events. If you need to "detect" and "process" the old data, as new events come in, then KSQL cannot do this.
If you need distinct values rather than by-key, you can create a table against some stream of events and filtering on HAVING COUNT(field) = 1 over a time window, which is the best you can do there. Ref - https://kafka-tutorials.confluent.io/finding-distinct-events/ksql.html
If you need indefinite time windows to ensure you only process a certain field once, then you'll want to use an external database, and optionally an internal cache, to perform lookups against. This would need to be done with a regular consumer, or Kafka Streams.

Is the mongo timestamp type atomic with the reads?

I guess the title is confusing, but I could not find a better one.
I have an event stream in MongoDB with multiple producers and one consumer. To ensure that I read each event exactly once in the correct order, I use the MongoDB timestamp type as an incrementing value, populated by the server. In the SQL world I would probably use an auto-incremented integer.
My consumer just polls MongoDB and asks for all events since the last timestamp it has seen. In one of the environments we have realized that sometimes the consumer does not handle all events. It does not happen very often, like one of 50.000 events is missed, but ideally it should not happen at all.
My assumption is that MongoDB does something like this internally.
ParseDocument(doc);
lock
{
SetTimestamp(doc);
}
WriteDocument(doc);
UpdateIndex(doc);
So it could happen that for a very short period of time an document is not available when the consumer queries the events, because only event #1, #2 and #4 is written yet and event #3 is written a fraction of a millisecond later.
I Have seen this with a C# client and MongoDB 4.2 running in Docker, but I guess the client does not matter here.
Is this assumption correct and if yes, what can I do it?
My idea is to change my consumer to ask for all events since the last timestamp minus a few seconds and then filter out the already received events in the consumer.
But is there a more elegant solution? Perhaps some way to enforce collection level write locks or could transactions help?
Since you said "consumer" - singular, I suggest:
Use a change stream to be notified of events. Change stream, if correctly iterated, will not skip changes nor will it return the same change twice.
Whenever a document is returned from change stream, when it is processed by the singular consumer, add a counter to it. Since there is only one consumer it is relatively easy to implement the counter without race conditions and such.
Also write the current resume token into each event being processed.
If you wish, you can use the counter to uniquely identify the events.
To iterate events again, use the counter to look up events in the past. Given that each event has both a counter and a resume token, once you get to the most recent event you can seamlessly transition from iterating based on the counter to iterating based on the resume token.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.

Kafka consume message in reverse order

I use Kafka 0.10, I have a Topic logs where my IoT devices post their logs into , The key of my messages are the device-id , so all the logs of the same device are in the same partition.
I have an api /devices/{id}/tail-logs that needs to display the N last logs of one device at the moment the call was made.
Currently I have it implemented in a very unefficient way (but working), as I start from the beginning (i.e oldest logs) of the partition containing the device's log until I reach current timestamp.
A more efficient way would be if I could get the current latest offset and then consume the messages backward (I would need to filter out some message to keep only those of the device i'm looking for)
Is it possible to do it with kafka ? If not how one can solve this problematic ? (a more heavy solution I would see would be to have a kafka-connect linked to an elastic search and then to query the elasticsearch but to have 2 more components for this seems a bit overkill...)
As you are on 0.10.2, I would recommend to write a Kafka Streams application. The application will be stateful and the state will hold the last N records/logs per device-id -- if new data is written to the input topic, the Kafka Streams application will just update it's state (without the need to re-read the whole topic).
Furthermore, the application also serves you request ("api /devices/{id}/tail-logs" using Interactive Queries feature.
Thus, I would not build a stateless application that has to recompute the answer for each request, but build a stateful application that eagerly compute the result (and update the result automatically all the time) for all possible requests (ie, for all device-ids) and just returns the already computed result when a request comes in.