I have a situation where i need to check whether a particular message already exists in a topic or not, i need absolutely no duplicates in the topic.
Can any one suggest any graceful way of doing this, rather than consuming all the messages and checking against them.
I do not consider myself an expert in Kafka, but I think what you pretend is "against" the essence of Kafka.
However I come out with a solution using the Kafka Streams library for Java. Basically, the process is the following:
Map every message into a new key-value where the key is a combination of the previous key and its value: (key1, message1) -> (key1-message1, message1)
Group the messages using the keys, as a result of this operation you obtain a KGroupedStream.
Apply a reduce function, modifying the value to some custom value such as the string "Duplicated value".
Convert the resulting KTable after the reduce into a KStream and push it into a new Kafka Topic.
There are so many assumptions in the previous explanation, I am going to provide some code in order to give some light:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> resources = builder.stream("topic-where-the-messages-are-sent");
KeyValueMapper<String, String, KeyValue<String,String>> kvMapper = new KeyValueMapper<String, String, KeyValue<String,String>>() {
public KeyValue<String, String> apply(String key, String value) {
return new KeyValue<String, String>(key + "-" + value, value);
}
};
Reducer<String> reducer = new Reducer<String>() {
public String apply(String value1, String value2) {
return "Duplicated message";
}
};
resources.map(kvMapper)
.groupByKey()
.reduce(reducer, "test-store-name")
.toStream()
.to("unique-message-output");
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
Have in mind that this is probably not an optimal solution and maybe you would not consider it as a "graceful" way of solving your problem.
I hope it helps.
Related
could you please help to configure a spring cloud stream app based on Kafka, I'm facing issue on the selectKey operation.
Let's explain what i m try to reach
2 incoming topics Person, RefGenre
Person contain the key of Refgenre (in value)
public class Person {
String nom;
String prenom;
String codeGenre; <<--- here is the key of the second topic refgenre
}
So I m using the selectKey operator to prepare my stream before the join operation.
a new topic is created with selectByKey (my-app-KSTREAM-KEY-SELECT-0000000004-repartition), and then serialization issue happens :
Exception in thread "my-app-3c57b31c-28e5-4199-b07d-87f8940425ab-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: ClassCastException while producing data to topic my-app-KSTREAM-KEY-SELECT-0000000004-repartition. A serializer (key: org.apache.kafka.common.serialization.StringSerializer / value: statefull.serde.PersonWithGenreSerde) is not compatible to the actual key or value type (key type: java.lang.String / value type: statefull.model.Person). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters (for example if using the DSL, #to(String topic, Produced<K, V> produced) with Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))).
Where can i specify serde for this repartition topic and can i specify the name of this "internal" topic ?
#Bean
public BiFunction<KStream<String, Person>, KTable<String, ReferentielGenre>, KStream<Long, PersonWithGenre>> joinKtable() {
return (persons, referentielGenres) ->
persons.selectKey((k,v) -> v.getCodeGenre())
.join(referentielGenres,
(person, genre) -> new PersonWithGenre(person.getNom(), person.getPrenom(),genre),
Joined.with(Serdes.String(), new PersonWithGenreSerde(), null));
}
here is the full code of my not working job : https://github.com/YohanAlard/joinkstream
Is there a better way to handle this usecase ?
I have a microservice that perform some stateful processing. The application construct a KStream from an input topic, do some stateful processing then write data into the output topic.
I will be running 3 of this applications in the same group. There are 3 parameters that I need to store in the event when the microservice goes down, the microservice that takes over can query the shared statestore and continue where the crashed service left off.
I am thinking of pushing these 3 parameters into a statestore and query the data when the other microservice takes over. From my research, I have seen a lot of example when people perform event counting using state store but that's not exactly what I want, does anyone know an example or what is the right approach for this problem?
So you want to do 2 things:
a. the service going down have to store the parameters:
If you want to do it in a straightforward way than all you have to do is to write a message in the topic associated with the state store (the one you are reading with a KTable). Use the Kafka Producer API or a KStream (could be kTable.toStream()) to do it and that's it.
Otherwise you could create manually a state store:
// take these serde as just an example
Serde<String> keySerde = Serdes.String();
Serde<String> valueSerde = Serdes.String();
KeyValueBytesStoreSupplier storeSupplier = inMemoryKeyValueStore(stateStoreName);
streamsBuilder.addStateStore(Stores.keyValueStoreBuilder(storeSupplier, keySerde, valueSerde));
then use it in a transformer or processor to add items to it; you'll have to declare this in the transformer/processor:
// depending on the serde above you might have something else then String
private KeyValueStore<String, String> stateStore;
and initialize the stateStore variable:
#Override
public void init(ProcessorContext context) {
stateStore = (KeyValueStore<String, String>) context.getStateStore(stateStoreName);
}
and later use the stateStore variable:
#Override
public KeyValue<String, String> transform(String key, String value) {
// using stateStore among other actions you might take here
stateStore.put(key, processedValue);
}
b. read the parameters in the service taking over:
You could do it with a Kafka consumer but with Kafka Streams you first have to make the store available; the easiest way to do it is by creating a KTable; then you have to get the queryable store name that is automatically created with the KTable; then you have to actually get access to the store; then you extract a record value from the store (i.e. a parameter value by its key).
// this example is a modified copy of KTable javadocs example
final StreamsBuilder streamsBuilder = new StreamsBuilder();
// Creating a KTable over the topic containing your parameters a store shall automatically be created.
//
// The serde for your MyParametersClassType could be
// new org.springframework.kafka.support.serializer.JsonSerde(MyParametersClassType.class)
// though further configurations might be necessary here - e.g. setting the trusted packages for the ObjectMapper behind JsonSerde.
//
// If the parameter-value class is a String then you could use Serdes.String() instead of a MyParametersClassType serde.
final KTable paramsTable = streamsBuilder.table("parametersTopicName", Consumed.with(Serdes.String(), <<your InstanceOfMyParametersClassType serde>>));
...
// see the example from KafkaStreams javadocs for more KafkaStreams related details
final KafkaStreams streams = ...;
streams.start()
...
// get the queryable store name that is automatically created with the KTable
final String queryableStoreName = paramsTable.queryableStoreName();
// get access to the store
ReadOnlyKeyValueStore view = streams.store(queryableStoreName, QueryableStoreTypes.timestampedKeyValueStore());
// extract a record value from the store
InstanceOfMyParametersClassType parameter = view.get(key);
I have millions of records each with unique identifier.
All the records are categorized by series number, let's say 10k records belong to series-1, another 10k to series-2 and so on..
Now I want to publish the all series-1 records to partition-1, all series-2 to partition-2 and so on..
To achieve this I don't want to use the message key, is there any other alternative?
I am new to kafka, Please correct me if the question is wrong or not have proper details?
You can use the below methods to publish a message on a specific partition
Simple Kafka Producer
/**
Creates a record to be sent to a specified topic and partition
**/
public ProducerRecord(String topic, Integer partition, K key, V value) {
this(topic, partition, null, key, value, null);
}
Basic example to publish a message on the partition
Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, <bootstrap server detail>);
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
//Define partition based on condition series-1 to 2 series-2 to 2
int partition=getPartitionOnCondition.....
String topic=""..
Producer<String, String> producer = new KafkaProducer<String,String>(
properties);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, partition, key, value);
producer.send(record);
Custom Partitioner
You can also use a custom partitioner for a producer or Stream partitioner
https://kafka.apache.org/documentation.html
Custom Stream Partitioner(In case you are using Kafka Stream)
If you are using Kafka Stream. It also provides a way to enable Custom Partitioner around Kafka Stream
https://kafka.apache.org/23/javadoc/org/apache/kafka/streams/processor/StreamPartitioner.html
Its better to create a Custom Partitioner class for your producer application.
For every record you publish with the Producer application, given custom partitioner's partition() method will be invoked with the messages key&value. There you may write your logic to parse the field to determine the partition number to which message should be written.
Create a custom partitioner class similar to below,
public class CustomPartitioner extends DefaultPartitioner {
#Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
Integer partitionNumber = null;
if (valueBytes != null) {
String valueStr = new String(valueBytes);
/* code to extract the partition number from value*/
/*assign partitionNumber decided based on the value*/
}
return partitionNumber;
}
}
Assign the partitioner in your Producer class and start publishing the messages
props.put("partitioner.class", "com.example.CustomPartitioner");
I'm trying to write a simple Kafka Streams application (targeting Kafka 2.2/Confluent 5.2) to transform an input topic with at-least-once semantics into an exactly-once output stream. I'd like to encode the following logic:
For each message with a given key:
Read a message timestamp from a string field in the message value
Retrieve the greatest timestamp we've previously seen for this key from a local state store
If the message timestamp is less than or equal to the timestamp in the state store, don't emit anything
If the timestamp is greater than the timestamp in the state store, or the key doesn't exist in the state store, emit the message and update the state store with the message's key/timestamp
(This is guaranteed to provide correct results based on ordering guarantees that we get from the upstream system; I'm not trying to do anything magical here.)
At first I thought I could do this with the Kafka Streams flatMapValues operator, which lets you map each input message to zero or more output messages with the same key. However, that documentation explicitly warns:
This is a stateless record-by-record operation (cf. transformValues(ValueTransformerSupplier, String...) for stateful value transformation).
That sounds promising, but the transformValues documentation doesn't make it clear how to emit zero or one output messages per input message. Unless that's what the // or null aside in the example is trying to say?
flatTransform also looked somewhat promising, but I don't need to manipulate the key, and if possible I'd like to avoid repartitioning.
Anyone know how to properly perform this kind of filtering?
you could use Transformer for implementing stateful operations as you described above. In order to not propagate a message downstream, you need to return null from transform method, this mentioned in Transformer java doc. And you could manage propagation via processorContext.forward(key, value). Simplified example provided below
kStream.transform(() -> new DemoTransformer(stateStoreName), stateStoreName)
public class DemoTransformer implements Transformer<String, String, KeyValue<String, String>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, String> keyValueStore;
public DemoTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
String existingValue = keyValueStore.get(key);
if (/* your condition */) {
processorContext.forward(key, value);
keyValueStore.put(key, value);
}
return null;
}
#Override
public void close() {
}
}
I'm working with Apache Kafka and I've been experimenting with the Kafka Streams functionality.
What I'm trying to achieve is very simple, at least in words and it can be achieved easily with the regular plain Consumer/Producer approach:
Read a from a dynamic list of topics
Do some processing on the message
Push the message to another topic which name is computed based on the message content
Initially I thought I could create a custom Sink or inject some kind of endpoint resolver in order to programmatically define the topic name for each single message, although ultimately couldn't find any way to do that.
So I dug into the code and found the ProducerInterceptor class that is (quoting from the JavaDoc):
A plugin interface that allows you to intercept (and possibly mutate)
the records received by the producer before they are published to the
Kafka cluster.
And it's onSend method:
This is called from KafkaProducer.send(ProducerRecord) and
KafkaProducer.send(ProducerRecord, Callback) methods, before key and
value get serialized and partition is assigned (if partition is not
specified in ProducerRecord).
It seemed like the perfect solution for me as I can effectively return a new ProducerRecord with the topic name I want.
Although apparently there's a bug (I've opened an issue on their JIRA: KAFKA-4691) and that method is called when the key and value have already been serialized.
Bummer as I don't think doing an additional deserialization at this point is acceptable.
My question to you more experienced and knowledgeable users would be your input and ideas and any kind of suggestions on how would be an efficient and elegant way of achieving it.
Thanks in advance for your help/comments/suggestions/ideas.
Below are some code snippets of what I've tried:
public static void main(String[] args) throws Exception {
StreamsConfig streamingConfig = new StreamsConfig(getProperties());
StringDeserializer stringDeserializer = new StringDeserializer();
StringSerializer stringSerializer = new StringSerializer();
MyObjectSerializer myObjectSerializer = new MyObjectSerializer();
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, myObjectSerializer, Pattern.compile("input-.*"));
.addProcessor("PROCESS", MyCustomProcessor::new, "SOURCE");
System.out.println("Starting PurchaseProcessor Example");
KafkaStreams streaming = new KafkaStreams(topologyBuilder, streamingConfig);
streaming.start();
System.out.println("Now started PurchaseProcessor Example");
}
private static Properties getProperties() {
Properties props = new Properties();
.....
.....
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), "com.test.kafka.streams.OutputTopicRouterInterceptor");
return props;
}
OutputTopicRouterInterceptor onSend implementation:
#Override
public ProducerRecord<String, MyObject> onSend(ProducerRecord<String, MyObject> record) {
MyObject obj = record.value();
String topic = computeTopicName(obj);
ProducerRecord<String, MyObject> newRecord = new ProducerRecord<String, MyObject>(topic, record.partition(), record.timestamp(), record.key(), obj);
return newRecord;
}