How to check message on uniqueness Kafka? - apache-kafka

There is topic Users with partitions.
Each partitions have messages about user data.
How to avoid duplications, for example dont allow inserting of the same user's name?
If I got this right I should create seperate topic Usernames and append all requested usernames.
Then before adding a new user in topic Users I ensure that there are not dublications in topic Usernames, right?
Accordingly using streams

I assume you are talking about a scenario where you are trying to publish events to Kafka topic from a micro-service.
Also, assuming you want to publish users profile --> username as key, user profile as value.
There are 2 issues of deduplication here :-
1.) you might get different usernames to your service at different times and publishing to topic.
2.) Duplicate message processing - During Broker failure(ack not received) or kafka client failures, the same message can be re-processed as kafka client does not hace ack info.
This can be taken care by enabling idempotency on kafka producers and atomic transactions.(Refer to Exactly Once processing)
I believe your question is about 1.) where your service receives duplicate messages.
Solution 1:-
If you are using micro-service, you can have an inmemory cache/DB of usernames and publish to kafka if duplicate is not found.
Solution 2:- (Handle on Kafka itself using streams)
input topic - users
Build an Kafka Stream client with stateStore(keyValueStore) and transformer to implement your dedupe logic.
So, your kafka stream client consumes the events from users topic and transforms in UserDedupeTransformer(where you have dedupe logic) and then produces to the output topic(as per ur requirement)
StoreBuilder<KeyValueStore<String, String>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("UserDedupeStoreName"),
Serdes.String(),
Serdes.String())
.withCachingEnabled();
builder.addStateStore(storeBuilder)
.stream("users-topic", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new UsersDedupeTransformer(), "usersDedupeStoreName")
.to("destination-topic");
In UserDedupeTransformer - Configured userDedupeStore and override the transform method -
public void init(ProcessorContext context) {
this.context = context;
dedupeStore = (KeyValueStore<String, String>) context.getStateStore("userDedupeStoreName");
}
public KeyValue<String, String> transform(String key, String v) {
if (null != key && null != dedupeStore.get(key))
return KeyValue.pair(key, value);
else
return null;
This dedupe store can be configured as In-Memory and also can be persisted using RocksDB.

Related

How to test a ProducerInterceptor in a Kafka Streams topology?

I have the requirement to pipe records from one topic to another, keeping the original partitioning intact (the original producer use a non-native Kafka partitioner). The reason for this is that the source topic is uncompressed, and we wish to "reprocess" the data into a compressed topic - transparently, from the point of view of the original producers and consumers.
I have a trivial KStreams topology that does this using a ProducerInterceptor:
void buildPipeline(StreamsBuilder streamsBuilder) {
streamsBuilder
.stream(topicProperties.getInput().getName())
.to(topicProperties.getOutput().getName());
}
together with:
interceptor.classes: com.acme.interceptor.PartitionByHeaderInterceptor
This interceptor looks in the message headers (which contain a partition Id header) and simply redirects the ProducerRecord to the original topic:
#Override
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
int partition = extractSourcePartition(record);
return new ProducerRecord<>(record.topic(), partition, record.timestamp(), record.key(), record.value(), record.headers());
}
My question is: how can I test this interceptor in a test topology (i.e. integration test)?
I've tried adding:
streamsConfiguration.put(StreamsConfig.producerPrefix("interceptor.classes"),
PartitionByHeaderInterceptor.class.getName());
(which is how I enable the interceptor in production code)
to my test topology stream configuration, but my interceptor is not called by the test topology driver.
Is what I'm trying to do currently technically possible?

Roll back mechanism in kafka processor api?

I am using kafka processor api (not DSL)
public class StreamProcessor implements Processor<String, String>
{
public ProcessorContext context;
public void init(ProcessorContext context)
{
this.context = context;
context.commit()
//statestore initialized with key,value
}
public void process(String key, String val)
{
try
{
String[] topicList = stateStore.get(key).split("|");
for(String topic: topicList)
{
context.forward(key,val,To.child(consumerTopic));
} // forward same message to list of topics ( 1..n topics) , rollback if write to some topics failed ?
}
}
}
Scenario : we are reading data from a source topic and stream
processor writes data to multiple sink topics (topicList above) .
Question: How to implement rollback mechanism using kafka streams
processor api when one or more of the topics in the topicList above
fails to receive the message ? .
What I understand is processor api has rollback mechanism for each
record it failed to send, or can roll back for an an entire batch of
messages which failed be achieved as well? as process method in
processor interface is called per record rather than per batch hence I
would surmise it can only be done per record.Is this correct assumption ?, if not please suggest
how to achieve per record and per batch rollbacks for failed topics using processor api.
You would need to implement it yourself. For example, you could use two stores: main-store, and "buffer" store and first only update the buffer store, call context.forward() second to make sure all write are in the output topic, and afterward merge the "buffer" store into the main store.
If you need to roll back, you drop the content from the buffer store.

Kafka exactly_once processing - do you need your streams app to produce to kafka topic as well?

I have a kafka streams app consuming from kafka topic. It only consumes and processes the data but doesn't produce anything.
For Kafka's exactly_once processing to work, do you also need your streams app to write to a kafka topic?
How can you achieve exactly_once if your streams app wants to process the message only once but not produce anything?
Providing “exactly-once” processing semantics really means that distinct updates to the state of an operator that is managed by the stream processing engine are only reflected once. “Exactly-once” by no means guarantees that processing of an event, i.e. execution of arbitrary user-defined logic, will happen only once.
Above is the "Exactly once" semantics explanation.
It is not necessary to publish the output to a topic always in KStream application.
When you are using KStream applications, you have to define an applicationID with each which uses a consumer in the backend. In the application, you have to configure few
parameters like processing.guarantee to exactly_once and enable.idempotence
Here are the details :
https://kafka.apache.org/22/documentation/streams/developer-guide/config-streams#processing-guarantee
I am not conflicting on exactly-once stream pattern because that's the beauty of Kafka Stream however its possible to use Kafka Stream without producing to other topics.
Exactly-once stream pattern is simply the ability to execute a read-process-write operation exactly one time. This means you consume one message at a time get the process and published to another topic and commit. So commit will be handle by Stream automatically one message a time.
Kafka Stream achieve these be setting below parameters which can not be overwritten
isolation.level: (read_committed) - Consumers will always read committed data only
enable.idempotence: (true) - Producer will always have idempotency enabled
max.in.flight.requests.per.connection" (5) - Producer will always have one in-flight request per connection
In case any error in the consumer or producer Kafka stream always retries a specific configured number of attempts.
KafkaStream doesn't guarantee inside processing logic we still need to handle e.g. there is a requirement for DB operation and if DB connection got failed in that case Kafka doesn't aware so you need to handle by your own.
As per pattern definition yes we need consumer, process, and producer topic but in general, it's not stopping you if you don't output to another topic. Still, you can consume exactly one item at a time with default time interval commit(DEFAULT_COMMIT_INTERVAL_MS) and again you need to handle your logic transaction failure by yourself
I am putting some sample examples.
StreamsBuilder builder = new StreamsBuilder();
Properties props = getStreamProperties();
KStream<String, String> textLines = builder.stream(Pattern.compile("topic"));
textLines.process(() -> new ProcessInternal());
KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
logger.info("Completed VQM stream");
streams.close();
}));
logger.info("Streaming start...");
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
class ProcessInternal implements Processor<String, String> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public void close() {
// Any code for clean up would go here.
}
#Override
public void process(String key, String value) {
///Your transactional process business logic
}
}

Kafka Consumer API jumping offsets

I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

how to identify and merge the messages of different queues in kafka

Background:
We had previously used hibernate search, Lucene and jboss hornetq queue for indexing.
Our Application is the producer and sends the metadata(unique data information to identify a record in the Database) to the hornetq.
Consumer receives this metadata and query against the database to fetch the complete record details(including child objects).
This is much more database centric approach.
Now we want to eliminate the database centric approach for indexing. We have decided to use kafka rather hornetq.
There is no issue when user creates the data.
We see there is a potential problem when the user edits the data(Say a parent entity with two child objects). When the data is pulled from the database for user display,
we push the same data to kafka topic1. When user modify's the data(say parenet level data) and submits. We get only the parent level data(don't get the child objects data), we push the changed data to topic2. Now we have to merge the message present in topic1(child objects) with the corresponding message in topic2(parent level data)
Note: We have to take this route as you know there is no update in Indexing rather it is delete and then insert.
Questions:
If i go with the above approach, how can I map the specific
message present in topic1 with the specific message in topic2. Is
there a way to provide the same message ids in topic1 and topic2?
Is there any way to resolve this issue if i use the single topic?
Is there any better design/approach to resolve the above issue?
Thanks in advance.
If i go with the above approach, how can I map the specific message present in topic1 with the specific message in topic2. Is there a way to provide the same message ids in topic1 and topic2 ?
To map or join the specific messages between topics in the same Kafka cluster maybe Kafka Stream and KSQL is a good direction to do. Can you find the reference here.
There are many ways to make an object unique and I suggest using parent entity id when you send messages to topic1 and topic2. Sample Java code as following:
ProducerRecord<String, ParentEntity> record = new ProducerRecord<>(topic1,
ParentEntity.getId(), ParentEntity);
ListenableFuture<SendResult<String, ParentEntity>> future =
kafkaTemplate.send(record);
future.addCallback(new ListenableFutureCallback<SendResult<String,
ParentEntity>>() {
#Override
public void onSuccess(SendResult<String, ParentEntity> result) {}
#Override
public void onFailure(Throwable ex) {
//print out error log
}
});
ProducerRecord<String, ChildEntity> record = new ProducerRecord<>(topic2,
ChildEntity.getParentEntityId(), ChildEntity);
ListenableFuture<SendResult<String, ChildEntity>> future =
kafkaTemplate.send(record);
future.addCallback(new ListenableFutureCallback<SendResult<String,
ChildEntity>>() {
#Override
public void onSuccess(SendResult<String, ChildEntity> result) {}
#Override
public void onFailure(Throwable ex) {
//print out error log
}
});
Is there any way to resolve this issue if i use the single topic ?
You can create a new table (said A) in database to store the full message to be sent for indexing. Every time user creates or updates data the message also to be inserted/updated to the table A. Finally your Kafka client pulls message objects from the table A and produce to an unique topic in Kafka cluster.
Is there any better design/approach to resolve the above issue ?
Can you try Kafka Stream and KSQL as I mentioned above.