Spring Batch + Kafka: KafkaItemReader run forever? - apache-kafka

I want to make something to monitor some Kafka topic continuously, and then execute some batch job when a message comes in (hitting some REST api and storing response). I set something up with KafkaItemReader, however, it turns off if it doesn't receive a message for 30 seconds based on pollTimeout. How can I make it run indefinitely? Since this is not an obvious option I'm wondering if I am using the right tool for the job.

Likely answer: you are not supposed to do this.
That's correct. Batch processing is about processing finite data sets. If your data source is an infinite stream of records and you want to monitor it continuously, then a streaming solution is more appropriate for your use case.

Related

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

KSQL query adds too much delay to my request

I have a system that saves (X,Y) coordinates to a SQL table. Then, I have an endpoint that when called returns the (X,Y) coordinates.
However my system takes up to 30 minutes to process and store a (X,Y) coordinate to the SQL table. In this sense, I am using KSQL to get that data faster.
I have added the call to KSQL in the endpoint of the backend I mentioned. The problem is that this call adds 6 extra seconds to my request.
My endpoint includes a query that looks like this
SELECT feature_a,feature_b FROM ksql_table;
The ksql_table has already been pre-processed by two previous streams. In my understanding, this query should be pretty straight forward and easy to compute. But it is taking 6 seconds to process.
When a KSQL query runs, it instantiates a Kafka Streams application that will build the table state requested. This is going to have a "spin-up" time, which doesn't matter when it's the stream processing application itself (since once it's running it stays running). However, if you're repeatedly calling it via the REST API as part of your application's flow then you are going to see this delay.
I think a more optimal way to work with the stream of data in Kafka would be to use Kafka Streams to build and persist the state required in a KTable, and then serve this through Interactive Query and a custom API that your nodejs application can interface with such as described here. Further examples are here and here.
There is also a nodejs Kafka Streams library, which I have not used but might be worth checking out.

How often put() is triggered in Kafka Connect sink tasks?

Can I control the intervals at which the put() method of my Kafka Connect Sink tasks is triggered? What is the expected behavior of the Kafka Connect framework in this respect? Ideally, I would like to specify, for example, "don't call me unless you have X new records/Y new bytes, or Z milliseconds passed since the last invocation". This could potentially make the batching logic within the sink task simpler (quoting the documentation, "in many cases internal buffering will be useful so an entire batch of records can be sent at once, reducing the overhead of inserting events into the downstream data store).
Today, put from a SinkTask is only called when deliverMessages is invoked in a WorkerSinkTask. The good news is that the only time deliverMessages happens is within poll so you should have some control over how often you poll for new records by overriding consumer properties.
If you want to do internal buffering, you could have a look at how the HDFSConnector is handling this in its implementation of SinkTask. However, right now, Connect will immediately put any records that get returned by the poll.
All of that said, if you are really looking to batch messages before they hit the downstream system, you might consider looking into offset.flush.interval.ms and offset.flush.timeout.ms which control how often flush() is invoked.

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.

Flink kafka to S3 with retry

I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead