How to fetch single record at a time in Kafka sink connector - apache-kafka

I am using Kafka Sink Task to read records from Kafka topic.
The put() in SinkTask method is the entry point from where all records will be fetched.
Currently when the connector starts, it will fetch all records together which are not committed.
I want the worker task to fetch single record at a time.
How to do it?
class CustomSinkTask extends SinkTask{
#Override
public void put(Collection<SinkRecord> records) {
System.out.println("Inside put method " );
if(records != null)
System.out.println("number of records fetched are:" + records.size());
}
}

You could try adding the following to the worker properties file
conusmer.max.poll.records=1

You could achieve this by setting the max poll records to the desired number in the Kafka connect property file. Make sure you are prefixing the max.poll.records property with consumer. To know more about the worker properties, please refer to this page.
consumer.max.poll.records=n

Related

Roll back mechanism in kafka processor api?

I am using kafka processor api (not DSL)
public class StreamProcessor implements Processor<String, String>
{
public ProcessorContext context;
public void init(ProcessorContext context)
{
this.context = context;
context.commit()
//statestore initialized with key,value
}
public void process(String key, String val)
{
try
{
String[] topicList = stateStore.get(key).split("|");
for(String topic: topicList)
{
context.forward(key,val,To.child(consumerTopic));
} // forward same message to list of topics ( 1..n topics) , rollback if write to some topics failed ?
}
}
}
Scenario : we are reading data from a source topic and stream
processor writes data to multiple sink topics (topicList above) .
Question: How to implement rollback mechanism using kafka streams
processor api when one or more of the topics in the topicList above
fails to receive the message ? .
What I understand is processor api has rollback mechanism for each
record it failed to send, or can roll back for an an entire batch of
messages which failed be achieved as well? as process method in
processor interface is called per record rather than per batch hence I
would surmise it can only be done per record.Is this correct assumption ?, if not please suggest
how to achieve per record and per batch rollbacks for failed topics using processor api.
You would need to implement it yourself. For example, you could use two stores: main-store, and "buffer" store and first only update the buffer store, call context.forward() second to make sure all write are in the output topic, and afterward merge the "buffer" store into the main store.
If you need to roll back, you drop the content from the buffer store.

Is it possible to reset offsets to a topic for a kafka consumer group in a kafka connector?

My kafka sink connector reads from multiple topics (configured with 10 tasks) and processes upwards of 300 records from all topics. Based on the information held in each record, the connector may perform certain operations.
Here is an example of the key:value pair in a trigger record:
"REPROCESS":"my-topic-1"
Upon reading this record, I would then need to reset the offsets of the topic 'my-topic-1' to 0 in each of its partitions.
I have read in many places that creating a new KafkaConsumer, subscribing to the topic's partitions, then calling the subscribe(...) method is the recommended way. For example,
public class MyTask extends SinkTask {
#Override
public void put(Collection<SinkRecord> records) {
records.forEach(record -> {
if (record.key().toString().equals("REPROCESS")) {
reprocessTopicRecords(record);
} else {
// do something else
}
});
}
private void reprocessTopicRecords(SinkRecord record) {
KafkaConsumer<JsonNode, JsonNode> reprocessorConsumer =
new KafkaConsumer<>(reprocessorProps, deserializer, deserializer);
reprocessorConsumer.subscribe(Arrays.asList(record.value().toString()),
new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// do offset reset here
}
}
);
}
}
However, the above strategy does not work for my case because:
1. It depends on a group rebalance taking place (does not always happen)
2. 'partitions' passed to the onPartitionsAssigned method are dynamically assigned partitions, meaning these are only a subset to the full set of partitions that will need to have their offset reset. For example, this SinkTask will be assigned only 2 of the 8 partitions that hold the records for 'my-topic-1'.
I've also looked into using assign() but this is not compatible with the distributed consumer model (consumer groups) in the SinkConnector/SinkTask implementation.
I am aware that the kafka command line tool kafka-consumer-groups can do exactly what I want (I think):
https://gist.github.com/marwei/cd40657c481f94ebe273ecc16601674b
To summarize, I want to reset the offsets of all partitions for a given topic using Java APIs and let the Sink Connector pick up the offset changes and continue to do what it has been doing (processing records).
Thanks in advance.
I was able to achieve resetting offsets for a kafka connect consumer group by using a series of Confluent's kafka-rest-proxy APIs: https://docs.confluent.io/current/kafka-rest/api.html
This implementation no longer requires the 'trigger record' approach firs described in the original post and is purely Rest API based.
Temporarily delete the kafka connector (this deletes the connector's consumers and )
Create a consumer instance for the same consumer group ("connect-")
Have the instance subscribe to the requested topic you want to reset
Do a dummy poll ('subscribe' is evaluated lazily')
Reset consumer group topic offsets for specified topic
Do a dummy poll ('seek' is evaluated lazily') Commit the current offset state (in the proxy) for the consumer
Re-create kafka connector (with same connector name) - after re-balancing, consumers will join the group and read the last committed offset (starting from 0)
Delete the temporary consumer instance
If you are able to use the CLI, Steps 2-6 can be replaced with:
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-earliest --execute
As for those of you trying to do this in the kafka connector code through native Java APIs, you're out of luck :-(
You're looking for the seek method. Either to an offset
consumer.seek(new TopicPartition("topic-name", partition), offset);
Or seekToBeginning
However, I feel like you'd be competing with the Connect Sink API's consumer group. In other words, assuming you setup the consumer with a separate group id, then you're essentially consuming records twice here from the source topic, once by Connect, and then your own consumer instance.
Unless you explicitly seek Connect's own consumer instance as well (which is not exposed), you'd be getting into a weird state. For example, your task only executes on new records to the topic, despite the fact your own consumer would be looking at an old offset, or you'd still be getting even newer events while still processing old ones
Also, eventually you might get a reprocess event at the very beginning of the topic due to retention policies, expiring old records, for example, causing your consumer to not progress at all and constantly rebalancing its group by seeking to the beginning
We had to do a very similar offset resetting exercise.
KafkaConsumer.seek() combined with KafkaConsumer.commitSync() worked well.
There is another option that is worth mentioning, if you are dealing with lots of topics and partitions (javadoc):
AdminClient.alterConsumerGroupOffsets(
String groupId,
Map<TopicPartition,OffsetAndMetadata> offsets
)
We were lucky because we had the luxury to stop the Kafka Connect instance for a while, so there's no consumer group competing.

how to identify and merge the messages of different queues in kafka

Background:
We had previously used hibernate search, Lucene and jboss hornetq queue for indexing.
Our Application is the producer and sends the metadata(unique data information to identify a record in the Database) to the hornetq.
Consumer receives this metadata and query against the database to fetch the complete record details(including child objects).
This is much more database centric approach.
Now we want to eliminate the database centric approach for indexing. We have decided to use kafka rather hornetq.
There is no issue when user creates the data.
We see there is a potential problem when the user edits the data(Say a parent entity with two child objects). When the data is pulled from the database for user display,
we push the same data to kafka topic1. When user modify's the data(say parenet level data) and submits. We get only the parent level data(don't get the child objects data), we push the changed data to topic2. Now we have to merge the message present in topic1(child objects) with the corresponding message in topic2(parent level data)
Note: We have to take this route as you know there is no update in Indexing rather it is delete and then insert.
Questions:
If i go with the above approach, how can I map the specific
message present in topic1 with the specific message in topic2. Is
there a way to provide the same message ids in topic1 and topic2?
Is there any way to resolve this issue if i use the single topic?
Is there any better design/approach to resolve the above issue?
Thanks in advance.
If i go with the above approach, how can I map the specific message present in topic1 with the specific message in topic2. Is there a way to provide the same message ids in topic1 and topic2 ?
To map or join the specific messages between topics in the same Kafka cluster maybe Kafka Stream and KSQL is a good direction to do. Can you find the reference here.
There are many ways to make an object unique and I suggest using parent entity id when you send messages to topic1 and topic2. Sample Java code as following:
ProducerRecord<String, ParentEntity> record = new ProducerRecord<>(topic1,
ParentEntity.getId(), ParentEntity);
ListenableFuture<SendResult<String, ParentEntity>> future =
kafkaTemplate.send(record);
future.addCallback(new ListenableFutureCallback<SendResult<String,
ParentEntity>>() {
#Override
public void onSuccess(SendResult<String, ParentEntity> result) {}
#Override
public void onFailure(Throwable ex) {
//print out error log
}
});
ProducerRecord<String, ChildEntity> record = new ProducerRecord<>(topic2,
ChildEntity.getParentEntityId(), ChildEntity);
ListenableFuture<SendResult<String, ChildEntity>> future =
kafkaTemplate.send(record);
future.addCallback(new ListenableFutureCallback<SendResult<String,
ChildEntity>>() {
#Override
public void onSuccess(SendResult<String, ChildEntity> result) {}
#Override
public void onFailure(Throwable ex) {
//print out error log
}
});
Is there any way to resolve this issue if i use the single topic ?
You can create a new table (said A) in database to store the full message to be sent for indexing. Every time user creates or updates data the message also to be inserted/updated to the table A. Finally your Kafka client pulls message objects from the table A and produce to an unique topic in Kafka cluster.
Is there any better design/approach to resolve the above issue ?
Can you try Kafka Stream and KSQL as I mentioned above.

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}

Apache Kafka: Exactly Once in Version 0.10

To achieve exactly-once processing of messages by Kafka consumer I am committing one message at a time, like below
public void commitOneRecordConsumer(long seconds) {
KafkaConsumer<String, String> consumer = consumerConfigFactory.getConsumerConfig();
try {
while (running) {
ConsumerRecords<String, String> records = consumer.poll(1000);
try {
for (ConsumerRecord<String, String> record : records) {
processingService.process(record);
consumer.commitSync(Collections.singletonMap(new TopicPartition(record.topic(),record.partition()), new OffsetAndMetadata(record.offset() + 1)));
System.out.println("Committed Offset" + ": " + record.offset());
}
} catch (CommitFailedException e) {
// application specific failure handling
}
}
} finally {
consumer.close();
}
}
The above code delegates the processing of message asynchronously to another class below.
#Service
public class ProcessingService {
#Async
public void process(ConsumerRecord<String, String> record) throws InterruptedException {
Thread.sleep(5000L);
Map<String, Object> map = new HashMap<>();
map.put("partition", record.partition());
map.put("offset", record.offset());
map.put("value", record.value());
System.out.println("Processed" + ": " + map);
}
}
However, this still does not guarantee exactly-once delivery, because if the processing fails, it might still commit other messages and the previous messages will never be processed and committed, what are my options here?
Original answer for 0.10.2 and older releases (for 0.11 and later releases see answer blow)
Currently, Kafka cannot provide exactly-once processing out-of-the box. You can either have at-least-once processing if you commit messages after you successfully processed them, or you can have at-most-once processing if you commit messages directly after poll() before you start processing.
(see also paragraph "Delivery Guarantees" in http://docs.confluent.io/3.0.0/clients/consumer.html#synchronous-commits)
However, at-least-once guarantee is "good enough" if your processing is idempotent, i.e., the final result will be the same even if you process a record twice. Examples for idempotent processing would be adding a message to a key-value store. Even if you add the same record twice, the second insert will just replace the first current key-value-pair and the KV-store will still have the correct data in it.
In your example code above, you update a HashMap and this would be an idempotent operation. Even if your might have an inconsistent state in case of failure if for example only two put calls are executed before the crash. However, this inconsistent state would be fixed on reprocessing the same record again.
The call to println() is not idempotent though because this is an operation with "side effect". But I guess the print is for debugging purpose only.
As an alternative, you would need to implement transaction semantics in your user code which requires to "undo" (partly executed) operation in case of failure. In general, this is a hard problem.
Update for Apache Kafka 0.11+ (for pre 0.11 releases see answer above)
Since 0.11, Apache Kafka supports idempotent producers, transactional producer, and exactly-once-processing using Kafka Streams. It also adds a "read_committed" mode to the consumer to only read committed messages (and to drop/filter aborted messages).
https://kafka.apache.org/documentation/#semantics
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
https://www.confluent.io/blog/transactions-apache-kafka/
https://www.confluent.io/blog/enabling-exactly-kafka-streams/
Apache Kafka 0.11.0.0 has been just released, it supports exactly once delivery now.
http://kafka.apache.org/documentation/#upgrade_11_exactly_once_semantics
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I think exactly once processing can be achieved with kafka 0.10.x itself. But there's some catch. I'm sharing the high level idea from this book. Relevant contents can be found in section: Seek and Exactly Once Processing in chapter 4: Kafka Consumers - Reading Data from Kafka. You can view the contents of that book with a (free) safaribooksonline account, or buy it once it's out, or maybe get it from other sources, which we shall not speak about.
Idea:
Think about this common scenario: Your application reads events from Kafka, processes the data, and then stores the results in a database. Suppose that we really don’t want to lose any data, nor do we want to store the same results in the database twice.
It's doable if there is a way to store both the record and the offset in one atomic action. Either both the record and the offset are committed, or neither of them are committed.
To achieve that, we need to write both the record and the offset to the database, in one transaction. Then we’ll know that either we are done with the record and the offset is committed or we are not, and the record will be reprocessed.
Now the only problem is: if the record is stored in a database and not in Kafka, how will our consumer know where to start reading when it is assigned a partition? This is exactly what seek() can be used for. When the consumer starts or when new partitions are assigned, it can look up the offset in the database and seek() to that location.
Sample code from the book:
public class SaveOffsetsOnRebalance implements ConsumerRebalanceListener {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
commitDBTransaction();
}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
for(TopicPartition partition: partitions)
consumer.seek(partition, getOffsetFromDB(partition));
}
}
consumer.subscribe(topics, new SaveOffsetOnRebalance(consumer));
consumer.poll(0);
for (TopicPartition partition: consumer.assignment())
consumer.seek(partition, getOffsetFromDB(partition));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
{
processRecord(record);
storeRecordInDB(record);
storeOffsetInDB(record.topic(), record.partition(), record.offset());
}
commitDBTransaction();
}