Reading latest data from Kafka broker in Apache Flink - apache-kafka

I want to receive the latest data from Kafka to Flink Program, but Flink is reading the historical data.
I have set auto.offset.reset to latest as shown below, but it did not work
properties.setProperty("auto.offset.reset", "latest");
Flink Programm is receiving the data from Kafka using below code
//getting stream from Kafka and giving it assignTimestampsAndWatermarks
DataStream<JoinedStreamEvent> raw_stream = envrionment.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties)).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
I was following the discussion on
https://issues.apache.org/jira/browse/FLINK-4280 , which suggests adding the source in below mentioned way
Properties props = new Properties();
...
FlinkKafkaConsumer kafka = new FlinkKafkaConsumer("topic", schema, props);
kafka.setStartFromEarliest();
kafka.setStartFromLatest();
kafka.setEnableCommitOffsets(boolean); // if true, commits on checkpoint if checkpointing is enabled, otherwise, periodically.
kafka.setForwardMetrics(boolean);
...
env.addSource(kafka)
I did the same, however, I was not able to access the setStartFromLatest()
FlinkKafkaConsumer09 kafka = new FlinkKafkaConsumer09<JoinedStreamEvent>( "test", new JoinSchema(),properties);
What should I do to receive the latest values that are being sent to
Kafka rather than receiving values from history?

The problem was solved by creating the new group id named test1 both for the sender and consumer and keeping the topic name same as test.
Now I am wondering, is this the best way to solve this issue? because
every time I need to give a new group id
Is there some way I can just read data that is being sent to Kafka?

I believe this could work for you. It did for me. Modify the properties and your kafka topic.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "ip:port");
properties.setProperty("zookeeper.connect", "ip:port");
properties.setProperty("group.id", "your-group-id");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("your-topic", new SimpleStringSchema(), properties));
stream.writeAsText("your-path", FileSystem.WriteMode.OVERWRITE)
.setParallelism(1);
env.execute();
}

Related

Kafka Consumer reading from some topics but not others

I have a Kafka Streams application that reads from a set of topics, enriches the messages with some extra data and then outputs onto another set of topics eg:
topic.blue.unprocessed -> Kafka Streams App -> topic.blue.processed
topic.yellow.unprocessed -> Kafka Streams App -> topic.yellow.processed
The consumer group is set up with a regex topic pattern and will read topics with the prefix topic.
This was working just fine for some time but I recently noticed it had stopped reading messages from some of the topics eg. no messages from topic.yellow.unprocessed are being read but topic.blue.unprocessed is still functioning fine.
I investigated the logs and could see that the app was still reading from topic.yellow.unprocessed a month ago, however there was a large delay of 5 days from when the message appeared on the topic until it was read by the Streams application. Now it is not reading them at all.
Wondering if anyone has an idea why this may be occurring for only some topics - I would expect if there was an issue with the app or consumer ACL that it would affect all topics but I'm not seeing that happening.
I have confirmed topic.yellow.unprocessed is deployed and is receiving messages - they just are not being consumed by the application. Debug logs are enabled but are showing nothing.
See below consumer code:
#Value("${kafka.configuration.inputTopicRegex}")
private String inputTopicRegex;
#Value("${kafka.configuration.deadLetterTopic}")
private String deadLetterTopic;
#Value("${kafka.configuration.brokerAddress}")
private String brokerAddress;
#Autowired
AvroRecodingSerde avroSerde;
public KafkaStreams createStreams() {
return new KafkaStreams(createTopology(), createKakfaProperties());
}
private Properties createKakfaProperties() {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "topic.color.app");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerAddress);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class.getName());
config.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 86400000);
return config;
}
public Topology createTopology() {
StreamsBuilder builder = new StreamsBuilder();
// stream of records
KStream<String, GenericRecord> ingressStream = builder.stream(Pattern.compile(inputTopicRegex), Consumed.with(Serdes.String(), avroSerde));
KStream<String, GenericRecord> processedStream = ingressStream.transformValues(enrichMessage);
processedStream.to(destinationOrDeadletter, Produced.with(Serdes.String(), avroSerde));
return builder.build();
}

how does the producer communicate with the registry and what does it send to the registry

I'm trying to understand this by reading the documentation but maybe because I'm not an advanced programmer I do not really understand it.
I'm in the documentation and for example in this example:
https://docs.confluent.io/current/schema-registry/serdes-develop/serdes-protobuf.html#protobuf-serializer
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer");
props.put("schema.registry.url", "http://127.0.0.1:8081");
Producer<String, MyRecord> producer = new KafkaProducer<String, MyRecord>(props);
String topic = "testproto";
String key = "testkey";
OtherRecord otherRecord = OtherRecord.newBuilder()
.setOtherId(123).build();
MyRecord myrecord = MyRecord.newBuilder()
.setF1("value1").setF2(otherRecord).build();
ProducerRecord<String, MyRecord> record
= new ProducerRecord<String, MyRecord>(topic, key, myrecord);
producer.send(record).get();
producer.close();
I see here that you define the schema registry url and then somehow the producer will know that it will send contact to the registry to provide some metadata on the messages to the registry.
Now I would like to understand better how does this actually work and what is exchanged between the producer and the registry (or is kafka that contact with the registry)?
Anyway my question is imagine I have a record that is in a protobuf format.
I'm putting that protobuf into kafka in a certain topic.
now I want to activate the schema registry so would the producer just send the proto definition into the schema registry?
does the producer just get the metadata definition directly from the record?
would it try to update in any new message into the queue? would this not increase a bit the latency when pushing the data to kafka?
Sorry if this is all very basic questions but I'm just trying to get the bigger picture and I'm missing this peace.
Thanks for any information, sorry if this is already clear from the documentation.
(I need to have this so that I can use ksql to deserialize my messages in kafka)
best regards,
would the producer just send the proto definition into the schema registry
The Serializer does, not the Producer directly.
MyRecord is serialized to binary, the schema is sent over HTTP to the Registry, which returns an ID, then the sent message contains 0x0 + ID + binary-protobuf-value
Source code here
would it try to update in any new message into the queue?
Schema gets sent before any messages are sent. Existing messages are untouched
would this not increase a bit the latency when pushing the data to kafka?
Only for the first message since the schema gets cached

Copy data between kafka topics using kafka connectors

I'm new to Kafka, now I need to copy data from one kafka topic to another. I'm wondering what are the possible ways to do so? The ways I can think of are following:
Kakfa consumer + Kafka producer
Kafka streams
Kafka sink connector + producer
Kafka consumer + source connector
My questions is: is that possible to use two kafka connectors in between? E.g. sink connector + source connector. Is so, could you please provide me some good examples? Or some hints of how to do so?
Thanks in advance!
All the methods you listed are possible. Which one is the best really depends on the control you want over the process or whether it's a one off operation or something you want to keep running.
Kafka Streams offers a easy way to flow one topic into another via the DSL
You could do something like (demo code obviously not for production!):
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
final Serde<byte[]> bytesSerdes = Serdes.ByteArray();
final StreamsBuilder builder = new StreamsBuilder();
KStream<byte[], byte[]> input = builder.stream(
"input-topic",
Consumed.with(bytesSerdes, bytesSerdes)
);
input.to("output-topic", Produced.with(bytesSerdes, bytesSerdes));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
try {
streams.start();
Thread.sleep(60000L);
} catch (Exception e) {
e.printStackTrace();
} finally {
streams.close();
}

FlinkKafkaConsumer082 auto.offset.reset setting doesn't work?

I have a Flink streaming program which read data from a topic of Kafka. In the program, auto.offset.reset is set to "smallest". When test in IDE/Intellij-IDEA, the program could always read data from the beginning of the topic. Then I set up a flink/kafka cluster and produced some data into kafka topic. The first time I run the streaming job, it could read data from the beginning of the topic. But after that I stopped the streaming job and run it again, it will not read data from the beginning of the topic. How could I make the program always read data from the beginning of the topic?
Properties properties = new Properties();
properties.put("bootstrap.servers", kafkaServers);
properties.put("zookeeper.connect", zkConStr);
properties.put("group.id", group);
properties.put("topic", topics);
properties.put("auto.offset.reset", offset);
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer082<String>(topics, new SimpleStringSchema(), properties));
If you want to read always from the beginning you need to disable checkpointing in your stream context.
Also disable it on the level of consumer properties:
enable.auto.commit=false or auto.commit.enable=false (depends on kafka version)
Another way:
you can keep ckeckpointing for failover but generate new group.id when you need to read from the beginning(just clean up sometimes zookeeper)

Kafka High Level Consumer Fetch All Messages From Topic Using Java API (Equivalent to --from-beginning)

I am testing the Kafka High Level Consumer using the ConsumerGroupExample code from the Kafka site. I would like to retrieve all the existing messages on the topic called "test" that I have in the Kafka server config. Looking at other blogs, auto.offset.reset should be set to "smallest" to be able to get all messages:
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
Properties props = new Properties();
props.put("zookeeper.connect", a_zookeeper);
props.put("group.id", a_groupId);
props.put("auto.offset.reset", "smallest");
props.put("zookeeper.session.timeout.ms", "10000");
return new ConsumerConfig(props);
}
The question I really have is this: what is the equivalent Java api call for the High Level Consumer that is the equivalent of:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Basically, everytime a new consumer tries to consume a topic, it'll read messages from the beginning. If you're especially just consuming from the beginning each time for testing purposes, everytime you initialise your consumer with a new groupID, it'll read the messages from the beginning. Here's how I did it :
properties.put("group.id", UUID.randomUUID().toString());
and read messages from the beginning each time!
Looks like you need to use the "low level SimpleConsumer API"
For most applications, the high level consumer Api is good enough.
Some applications want features not exposed to the high level consumer
yet (e.g., set initial offset when restarting the consumer). They can
instead use our low level SimpleConsumer Api. The logic will be a bit
more complicated and you can follow the example in here.
This example worked for getting all messages from a topic with the following arguments: (note that the port is the Kafka port, not the ZooKeeper port, topics set up from this example):
10 my-replicated-topic 0 localhost 9092
Specifically, there is a method to get readOffset which takes kafka.api.OffsetRequest.EarliestTime():
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
Here is another post may provide some alternate ideas on how to sort this out: How to get data from old offset point in Kafka?
To fetch messages from the beginning, you can do this:
import kafka.utils.ZkUtils;
ZkUtils.maybeDeletePath("zkhost:zkport", "/consumers/group.id");
then just follow the routine work...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("auto.offset.reset", "earliest");
props.put("group.id", UUID.randomUUID().toString());
This properties will help you.