Kafka consume message in reverse order - apache-kafka

I use Kafka 0.10, I have a Topic logs where my IoT devices post their logs into , The key of my messages are the device-id , so all the logs of the same device are in the same partition.
I have an api /devices/{id}/tail-logs that needs to display the N last logs of one device at the moment the call was made.
Currently I have it implemented in a very unefficient way (but working), as I start from the beginning (i.e oldest logs) of the partition containing the device's log until I reach current timestamp.
A more efficient way would be if I could get the current latest offset and then consume the messages backward (I would need to filter out some message to keep only those of the device i'm looking for)
Is it possible to do it with kafka ? If not how one can solve this problematic ? (a more heavy solution I would see would be to have a kafka-connect linked to an elastic search and then to query the elasticsearch but to have 2 more components for this seems a bit overkill...)

As you are on 0.10.2, I would recommend to write a Kafka Streams application. The application will be stateful and the state will hold the last N records/logs per device-id -- if new data is written to the input topic, the Kafka Streams application will just update it's state (without the need to re-read the whole topic).
Furthermore, the application also serves you request ("api /devices/{id}/tail-logs" using Interactive Queries feature.
Thus, I would not build a stateless application that has to recompute the answer for each request, but build a stateful application that eagerly compute the result (and update the result automatically all the time) for all possible requests (ie, for all device-ids) and just returns the already computed result when a request comes in.

Related

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Brod: Every brod_group_subscriber_v2 needs its own brod client

DISCLAIMER: This question is specifically about the Erlang/OTP Kafka Client library Brod (No Tag available yet)
I am trying to establish three consumer groups, one should write messages just plain to the console, another one should update a state representing API with certain messages, and the third one should store every message into a long term database (crate). I use a supervisor to start three according brod_group_subscriber_v2 (see this GIST). If I also start three brod (kafka) clients first and attach each group subscriber its own client, everything works perfectly so that offsets are commited to Kafka for every group and reads start from the latest commited offset.
If I use only one client (as it should be possible and intended according to the Brod docs and issues, see Reference by zmstone), only the last group in my CHILD_SPEC works, both other do not receive handle_message calls.
At the moment starting a client for every group is not an issue for me, as there are only three. But as our project grows, we plan to establish a couple of consumer groups, and I don't really think that it might be a good idea to run 20 to 30 brod clients and blocking ressources for each of them.

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

KSQL query adds too much delay to my request

I have a system that saves (X,Y) coordinates to a SQL table. Then, I have an endpoint that when called returns the (X,Y) coordinates.
However my system takes up to 30 minutes to process and store a (X,Y) coordinate to the SQL table. In this sense, I am using KSQL to get that data faster.
I have added the call to KSQL in the endpoint of the backend I mentioned. The problem is that this call adds 6 extra seconds to my request.
My endpoint includes a query that looks like this
SELECT feature_a,feature_b FROM ksql_table;
The ksql_table has already been pre-processed by two previous streams. In my understanding, this query should be pretty straight forward and easy to compute. But it is taking 6 seconds to process.
When a KSQL query runs, it instantiates a Kafka Streams application that will build the table state requested. This is going to have a "spin-up" time, which doesn't matter when it's the stream processing application itself (since once it's running it stays running). However, if you're repeatedly calling it via the REST API as part of your application's flow then you are going to see this delay.
I think a more optimal way to work with the stream of data in Kafka would be to use Kafka Streams to build and persist the state required in a KTable, and then serve this through Interactive Query and a custom API that your nodejs application can interface with such as described here. Further examples are here and here.
There is also a nodejs Kafka Streams library, which I have not used but might be worth checking out.

Writing directly to a kafka state store

We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
{
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
}
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
context.put(...)
context.commit();
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.