How to have sync producer batch messages? - apache-kafka

We were using the Kafka 0.8 async producer but it is dropping messages (and there is no aysnc response from another thread or we could keep using async).
We have set the batch.num.messages to 500 and our consumer is not changing. I read that batch.num.messages only applies to the async producer and not sync so I need to batch myself. We are using compression.codec=snappy and our own serializer class.
My question is two-fold:
Can I assume that I can just use our own serializer class and then send the message on my own?
Do I need to worry about any special snappy options/parameters that Kafka might be using?

Yes, it's because batch.num.messages controls behaviour of async producer only. This is explicitly said so in relevant guide on parameters:
The number of messages to send in one batch when using async mode. The producer will wait until either this number of messages are ready to send or queue.buffer.max.ms is reached.
In order to have batching for sync producer you have to send list of messages:
public void trySend(List<M> messages) {
List<KeyedMessage<String, M>> keyedMessages = Lists.newArrayListWithExpectedSize(messages.size());
for (M m : messages) {
keyedMessages.add(new KeyedMessage<String, M>(topic, m));
}
try {
producer.send(keyedMessages);
} catch (Exception ex) {
log.error(ex)
}
}
Note that I'm using kafka.javaapi.producer.Producer here.
Once send is executed, batch is sent.
Can I assume that I can just use our own serializer class and then send the message on my own?
Do I need to worry about any special snappy options/parameters that Kafka might be using?
Both, compression and serializer are orthogonal features that don't affect batching, but actually applied to individual messages.
Note that there will be api changes and async/sync api will be unified.

Related

What happens to the consumer offset when an error occurs within a custom class in a KStream topology?

I'm aware that you can define stream-processing Kafka application in the form of a topology that implicitly understands which record has gone through successfully, and therefore can correctly commit the consumer offset so that when the microservice has to be restarted, it will continue reading the input toppic without missing messages.
But what happens when I introduce my own processing classes into the stream? For instance, perhaps I need to submit information from the input records to a web service with a big startup time. So I write my own processor class that accumulates, say, 1000 messages and then submits a batch request to the external service, like this.
KStream<String, Prediction> stream = new StreamsBuilder()
.stream(inputTopic, Consumed.with(Serdes.String(), new MessageSerde()))
// talk to web service
.map((k, v) -> new KeyValue<>("", wrapper.consume(v.getPayload())))
.flatMapValues((ValueMapper<List<Prediction>, Iterable<Prediction>>) value -> value);
// send downstream
stream.peek((k, v) -> metrics.countOutgoingMessage())
.to(outputTopic, Produced.with(Serdes.String(), new PredictionSerde()));
Assuming that the external service can issue zero, one or more predictions of some kind for every input, and that my wrapper submits inputs in batches to increase throughput. It seems to me that KStream cannot possibly keep track of which input record corresponds to which output record, and therefore no matter how it is implemented, it cannot guarantee that the correct consumer offset for the input topic is committed.
So in this paradigm, how can I give the library hints about which messages have been successfully processed? Or failing that, how can I get access to the consumer offset for the topic and perform commits explicitly so that no data loss can occur?
I think you would might have a problem if you are using map. combining remote calls in a DSL operator is not recommended. You might want to look into using the Processor API docs. With ProcessorContext you can forward or commit which could give you flexibility you need.

How to efficiently produce messages out of a collection to Kafka

In my Scala (2.11) stream application I am consuming data from one queue in IBM MQ and writing it to a Kafka topic that has one partition. After consuming the data from the MQ the message payload gets splitted into 3000 smaller messages that are stored in a Sequence of Strings. Then each of these 3000 messages are send to Kafka (version 2.x) using KafkaProducer.
How would you send those 3000 messages?
I can't increase the number of queues in IBM MQ (not under my control) nor the number of partitions in the topic (ordering of messages is required, and writing a custom partitioner will impact too many consumers of the topic).
The Producer settings are currently:
acks=1
linger.ms=0
batch.size=65536
But optimizing them is probably a question of its own and not part of my current problem.
Currently, I am doing
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
private lazy val kafkaProducer: KafkaProducer[String, String] = new KafkaProducer[String, String](someProperties)
val messages: Seq[String] = Seq(String1, …, String3000)
for (msg <- messages) {
val future = kafkaProducer.send(new ProducerRecord[String, String](someTopic, someKey, msg))
val recordMetadata = future.get()
}
To me it looks like not the most elegant and most efficient way. Is there a programmatic way to increase throughput?
edit after answer from #radai
Thanks to the answer pointing me to the right direction I had a closer look into the different Producer methods. The book Kafka - The Definitive Guide list these methods:
Fire-and-forget
We send a message to the server and don’t really care if it arrives succesfully or not. Most of the time, it will arrive successfully, since Kafka is highly available and the producer will retry sending messages automatically. However, some messages will get lost using this method.
Synchronous send
We send a message, the send() method returns a Future object, and we use get()
to wait on the future and see if the send() was successful or not.
Asynchronous send
We call the send() method with a callback function, which gets triggered when it
receives a response from the Kafka broker
And now my code looks like this (leaving out error handling and the definition of Callback class):
val asyncProducer = new KafkaProducer[String, String](someProperties)
for (msg <- messages) {
val record = new ProducerRecord[String, String](someTopic, someKey, msg)
asyncProducer.send(record, new compareProducerCallback)
}
asyncProducer.flush()
I have compared all the methods for 10000 very small messages. Here is my measure result:
Fire-and-forget: 173683464ns
Synchronous send: 29195039875ns
Asynchronous send: 44153826ns
To be honest, there is probably more potential to optimize all of them by choosing the right properties (batch.size, linger.ms, ...).
the biggest reason i can see for your code to be slow is that youre waiting on every single send future.
kafka was designed to send batches. by sending one record at a time youre waiting round-trip time for every single record and youre not getting any benefit from compression.
the "idiomatic" thing to do would be send everything, and then block on all the resulting futures in a 2nd loop.
also, if you intend to do this i'd bump linger back up (otherwise your 1st record would result in a batch of size one, slowing you down overall. see https://en.wikipedia.org/wiki/Nagle%27s_algorithm) and call flush() on the producer once your send loop is done.

max.in.flight.requests.per.connection and Spring Kafka Producer Synchronous Event Publishing with KafkaTemplate

I'm a bit confused about the relationship between max.in.flight.requests.per.connection for Kafka Producers and synchronous publishing of events using Spring-Kafka and was hoping someone might be able to clear up the relationship between the two.
I'm looking to set up synchronous event publishing with Spring Kafka using Spring Kafka's KafkaTemplate. The Spring Kafka documentation provides an example using ListenableFuture's get(SOME_TIME, TimeUnit) to enable synchronous publishing of events (duplicated below for reference).
public void sendToKafka(final MyOutputData data) {
final ProducerRecord<String, String> record = createRecord(data);
try {
template.send(record).get(10, TimeUnit.SECONDS);
handleSuccess(data);
}
catch (ExecutionException e) {
handleFailure(data, record, e.getCause());
}
catch (TimeoutException | InterruptedException e) {
handleFailure(data, record, e);
}
}
I was looking at Kafka's Producer Configuration Documentation and saw that Kafka had a configuration for max.in.flight.requests.per.connection, which was responsible for the below setting in Kafka.
The maximum number of unacknowledged requests the client will send on a single connection before blocking. Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled).
What value does max.in.flight.requests.per.connection give set to a value of 1 give when event publishing is handled asynchronously? Does setting max.in.flight.requests.per.connection to a value of 1 force synchronous publishing of events for a Kafka Producer? If I want to set up synchronous publishing of events for a Kafka Producer and take the approach recommended by Spring-Kafka, should I be concerned about max.in.flight.requests.per.connection or is it safe to ignore this?
I don't believe they are related at all. The send is still asynchronous; setting it to one means the second will block until the first completes.
future1 = template.send(...);
future2 = template.send(...); // this will block
future1.get(); // and this will return almost immediately
future2.get();
You still need to get the result of the future, to test success/failure.

kafka Java API Consumer and producer Offset value comparison?

I have a requirement to match Kafka producer offset value to consumer offset by using Java API?
I am new to KAFKA,Could anyone suggest how to proceed with this ?
Depending on your exact use case there are a couple of ways that you could go about this, but all of them will require an external system.
First of, Confluent offers the Confluent Control Center as part of their commercial offering, this would probably be the easiest way to go about this, if you are willing to spend the money.
If that is not for you, then you'd need to implement some sort of system to keep track of what you are producing and what you are consuming. For example you could simply use a database, take topic, partition and offset as primary key and have columns for produced_at and consumed_at.
Every time your producer writes a message to the cluster you have it update the produced_at column (look at ProducerInterceptor). Same on the consumer side, you could implement an interceptor that confirms having read the message, or confirm from the consumer itself, once it has successfully been processed.
Or if you don't need every message confirmed you could just implement regular checkpointing every 10k messages or something similar and trust that the consumer read everything up to the last offset it confirmed.
There's also the possibility of injecting checkpoint messages into the stream at regular intervalls and when the consumer sees one of these it triggers an action - again, you have to trust the consumer that it got everything in between the checkpoints.
As I said initially, it all depends on your exact use case, if you give us more detail I'm sure we can come up with something that works for you.
Update:
If you want to retrieve the offset after sending a message to Kafka you need to check the Future that the producer returns on send, this will contain the offset.
// Send message and store the future
Future<RecordMetadata> messageFuture = producer.send(new ProducerRecord<String, byte[]>(topic, serialize(currentMessage)));
producer.flush();
// as flush blocks until all operations have been completed (regardless of success or failure) we can be sure
// that our future is available at this point
try {
RecordMetadata metaData = messageFuture.get();
System.out.println("Sent message with offset: " + metaData.offset());
} catch (Exception e) {
// do some error handling
}
You can expose the offset of the producer and the consumer via Java Management Beans. There by you can do the comparison in realtime using the JConsole provided with the JDK.
Read about Gauge on how to expose the offset position of the producer and the consumer.

custom Flume interceptor: intercept() method called multiple times for the same Event

TL;DR
When a Flume source fails to push a transaction to the next channel in the pipeline, does it always keep event instances for the next try?
In general, is it safe to have a stateful Flume interceptor, where processing of events depends on previously processed events?
Full problem description:
I am considering the possibility of leveraging guarantees offered by Apache Kafka regarding the way topic partitions are distributed among consumers in a consumer group to perform streaming deduplication in an existing Flume-based log consolidation architecture.
Using the Kafka Source for Flume and custom routing to Kafka topic partitions, I can ensure that every event that should go to the same logical "deduplication queue" will be processed by a single Flume agent in the cluster (for as long as there are no agent stops/starts within the cluster). I have the following setup using a custom-made Flume interceptor:
[KafkaSource with deduplication interceptor]-->()MemoryChannel)-->[HDFSSink]
It seems that when the Flume Kafka source runner is unable to push a batch of events to the memory channel, the event instances that are part of the batch are passed again to my interceptor's intercept() method. In this case, it was easy to add a tag (in the form of a Flume event header) to processed events to distinguish actual duplicates from events in a failed batch that got re-processed.
However, I would like to know if there is any explicit guarantee that Event instances in failed transactions are kept for the next try or if there is the possibility that events are read again from the actual source (in this case, Kafka) and re-built from zero. In that case, my interceptor will consider those events to be duplicates and discard them, even though they were never delivered to the channel.
EDIT
This is how my interceptor distinguishes an Event instance that was already processed from a non-processed event:
public Event intercept(Event event) {
Map<String,String> headers = event.getHeaders();
// tagHeaderName is the name of the header used to tag events, never null
if( !tagHeaderName.isEmpty() ) {
// Don't look further if event was already processed...
if( headers.get(tagHeaderName)!=null )
return event;
// Mark it as processed otherwise...
else
headers.put(tagHeaderName, "");
}
// Continue processing of event...
}
I encountered the similar issue:
When a sink write failed, Kafka Source still hold the data that has already been processed by interceptors. In next attempt, those data will send to interceptors, and get processed again and again. By reading the KafkaSource's code, I believe it's bug.
My interceptor will strip some information from origin message, and will modify the origin message. Due to this bug, the retry mechanism will never work as expected.
So far, The is no easy solution.