Why kafka stream exactly once producer send duplicated records?

Why kafka stream exactly once producer send duplicated records? - apache-kafka

I use the kafka stream with exactly once guarantee, however, when I debug it, the producer.commitTransaction doesn't execute, and get rebalanced, but I always get duplicated outputs.
My application is just to filter and re route the topic messages.
This config is like below:
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServers);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Long().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.ByteArray().getClass());
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "5");
props.put(StreamsConfig.producerPrefix(ProducerConfig.LINGER_MS_CONFIG), "5");

Related

Copy data between kafka topics using kafka connectors

I'm new to Kafka, now I need to copy data from one kafka topic to another. I'm wondering what are the possible ways to do so? The ways I can think of are following:
Kakfa consumer + Kafka producer
Kafka streams
Kafka sink connector + producer
Kafka consumer + source connector
My questions is: is that possible to use two kafka connectors in between? E.g. sink connector + source connector. Is so, could you please provide me some good examples? Or some hints of how to do so?
Thanks in advance!

All the methods you listed are possible. Which one is the best really depends on the control you want over the process or whether it's a one off operation or something you want to keep running.
Kafka Streams offers a easy way to flow one topic into another via the DSL
You could do something like (demo code obviously not for production!):
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
final Serde<byte[]> bytesSerdes = Serdes.ByteArray();
final StreamsBuilder builder = new StreamsBuilder();
KStream<byte[], byte[]> input = builder.stream(
"input-topic",
Consumed.with(bytesSerdes, bytesSerdes)
);
input.to("output-topic", Produced.with(bytesSerdes, bytesSerdes));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
try {
streams.start();
Thread.sleep(60000L);
} catch (Exception e) {
e.printStackTrace();
} finally {
streams.close();
}

Kafka Keyless Producer In Java

I am new to the kafka eco system and in my case I'm using a Java producer but have no need for sending a key along with the record value which is serialized Avro. Is there a way to build a Java Producer to will not send keys, or are keys a requirement when sending messages in Kafka?

Like #gasparms said, there are built-in ways to produce without sending in a key. Most people use Kafka this way, since they just want to be able to send a stream of messages, with no key. Using keys is only really required if you need log compaction
Here's a really good explanation -
https://stackoverflow.com/a/29515696/236528

ProducerRecord has several constructors, one of them don't have value for the key, so you don't have to indicate it.
Example:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for(int i = 0; i < 100; i++)
producer.send(new ProducerRecord<>("my-topic", "myValue"));
producer.close();

Kafka producer behavior

I am trying latest kafka version 1.1.0.
I have point which bothering me about producer behavior.
Below is a small piece of code
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 3);
props.put("max.in.flight.requests.per.connection", 1);
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic",
Integer.toString(i), Integer.toString(i)), new CallBack());
assumptions
each message is sent to the same partition of a topic.
Size of each message is large enough that, it is submitted to the broker (not held in the buffer)
Now when the index is 0 and send method fails but for subsequent send call didn't fail, then, in that case, will the message's reach to broker in out of sequence. That is index 0 message will not be the first to reach to broker even if a retry code is added.
Will it be the same behavior if I add below configuration property
enable.idempotence=true
Is there any elegant approach to handle this situation? that is to maintain the order of messages
Thanks in advance

Reading latest data from Kafka broker in Apache Flink

I want to receive the latest data from Kafka to Flink Program, but Flink is reading the historical data.
I have set auto.offset.reset to latest as shown below, but it did not work
properties.setProperty("auto.offset.reset", "latest");
Flink Programm is receiving the data from Kafka using below code
//getting stream from Kafka and giving it assignTimestampsAndWatermarks
DataStream<JoinedStreamEvent> raw_stream = envrionment.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties)).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
I was following the discussion on
https://issues.apache.org/jira/browse/FLINK-4280 , which suggests adding the source in below mentioned way
Properties props = new Properties();
...
FlinkKafkaConsumer kafka = new FlinkKafkaConsumer("topic", schema, props);
kafka.setStartFromEarliest();
kafka.setStartFromLatest();
kafka.setEnableCommitOffsets(boolean); // if true, commits on checkpoint if checkpointing is enabled, otherwise, periodically.
kafka.setForwardMetrics(boolean);
...
env.addSource(kafka)
I did the same, however, I was not able to access the setStartFromLatest()
FlinkKafkaConsumer09 kafka = new FlinkKafkaConsumer09<JoinedStreamEvent>( "test", new JoinSchema(),properties);
What should I do to receive the latest values that are being sent to
Kafka rather than receiving values from history?

The problem was solved by creating the new group id named test1 both for the sender and consumer and keeping the topic name same as test.
Now I am wondering, is this the best way to solve this issue? because
every time I need to give a new group id
Is there some way I can just read data that is being sent to Kafka?

I believe this could work for you. It did for me. Modify the properties and your kafka topic.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "ip:port");
properties.setProperty("zookeeper.connect", "ip:port");
properties.setProperty("group.id", "your-group-id");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("your-topic", new SimpleStringSchema(), properties));
stream.writeAsText("your-path", FileSystem.WriteMode.OVERWRITE)
.setParallelism(1);
env.execute();
}

Kafka Consumer subscription vs. assigned partition

Kafka is confusing me. I am running it local with standard values.
only auto create topic turned on. 1 partition, 1 node, everything local and simple.
If it write
consumer.subscribe("test_topic");
consumer.poll(10);
It simply won't work and never finds any data.
If I instead assign a partition like
consumer.assign(new TopicPartition("test_topic",0));
and check the position I sit at 995. and now can poll and receive all the data my producer put in.
What is it that I don't understand about subscriptions? I don't need multiple consumers each handling only a part of the data. My consumer needs to get all the data of a certain topic. Why does the standard subscription approach not work for me that is shown in all the tutorials?
I do understand that partitions are for load balancing consumers. I don't understand what I do wrong with the subscription.
consumer config properties
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "postproc-" + EnvUtils.getAppInst()); // jeder ist eine eigene gruppe -> kriegt alles
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<Long, byte[]> consumer = new KafkaConsumer<Long, byte[]>(props);
producer config
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("retries", 2);
props.put("batch.size", 16384);
props.put("linger.ms", 5000);
props.put("buffer.memory", 1024 * 1024 * 10); // 10mb
props.put("key.serializer", "org.apache.kafka.common.serialization.LongSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
return new KafkaProducer(props);
producer execution
try (ByteArrayOutputStream out = new ByteArrayOutputStream()){
event.writeDelimitedTo(out);
for (long a = 10; a<20;a++){
long rand=new Random(a).nextLong();
producer.send(new ProducerRecord<>("test_topic",rand ,out.toByteArray()));
}
producer.flush();
}catch (IOException e){
consumer execution
consumer.subscribe(Arrays.asList("test_topic"));
ConsumerRecords<Long,byte[]> records = consumer.poll(10);
for (ConsumerRecord<Long,byte[]> r :records){ ...

I managed to solve the issue. The problem were timeouts. When piling I didn't give it enough time to complete. I assume assigning a partition just is a lot faster and therfore completed timely. The standard subscription poll takes longer. Never actually finished and did not commit.
At least I think that was the problem. With longer timeouts it works.

You are missing this property I think
auto.offset.reset=earliest
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Reference: http://kafka.apache.org/documentation.html#highlevelconsumerapi

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why kafka stream exactly once producer send duplicated records? - apache-kafka

Related

Copy data between kafka topics using kafka connectors

Kafka Keyless Producer In Java

Kafka producer behavior

Reading latest data from Kafka broker in Apache Flink

Kafka Consumer subscription vs. assigned partition

Categories

Resources