I have a Kafka Streams application that reads from a set of topics, enriches the messages with some extra data and then outputs onto another set of topics eg:
topic.blue.unprocessed -> Kafka Streams App -> topic.blue.processed
topic.yellow.unprocessed -> Kafka Streams App -> topic.yellow.processed
The consumer group is set up with a regex topic pattern and will read topics with the prefix topic.
This was working just fine for some time but I recently noticed it had stopped reading messages from some of the topics eg. no messages from topic.yellow.unprocessed are being read but topic.blue.unprocessed is still functioning fine.
I investigated the logs and could see that the app was still reading from topic.yellow.unprocessed a month ago, however there was a large delay of 5 days from when the message appeared on the topic until it was read by the Streams application. Now it is not reading them at all.
Wondering if anyone has an idea why this may be occurring for only some topics - I would expect if there was an issue with the app or consumer ACL that it would affect all topics but I'm not seeing that happening.
I have confirmed topic.yellow.unprocessed is deployed and is receiving messages - they just are not being consumed by the application. Debug logs are enabled but are showing nothing.
See below consumer code:
#Value("${kafka.configuration.inputTopicRegex}")
private String inputTopicRegex;
#Value("${kafka.configuration.deadLetterTopic}")
private String deadLetterTopic;
#Value("${kafka.configuration.brokerAddress}")
private String brokerAddress;
#Autowired
AvroRecodingSerde avroSerde;
public KafkaStreams createStreams() {
return new KafkaStreams(createTopology(), createKakfaProperties());
}
private Properties createKakfaProperties() {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "topic.color.app");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerAddress);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class.getName());
config.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 86400000);
return config;
}
public Topology createTopology() {
StreamsBuilder builder = new StreamsBuilder();
// stream of records
KStream<String, GenericRecord> ingressStream = builder.stream(Pattern.compile(inputTopicRegex), Consumed.with(Serdes.String(), avroSerde));
KStream<String, GenericRecord> processedStream = ingressStream.transformValues(enrichMessage);
processedStream.to(destinationOrDeadletter, Produced.with(Serdes.String(), avroSerde));
return builder.build();
}
Related
I have the requirement to pipe records from one topic to another, keeping the original partitioning intact (the original producer use a non-native Kafka partitioner). The reason for this is that the source topic is uncompressed, and we wish to "reprocess" the data into a compressed topic - transparently, from the point of view of the original producers and consumers.
I have a trivial KStreams topology that does this using a ProducerInterceptor:
void buildPipeline(StreamsBuilder streamsBuilder) {
streamsBuilder
.stream(topicProperties.getInput().getName())
.to(topicProperties.getOutput().getName());
}
together with:
interceptor.classes: com.acme.interceptor.PartitionByHeaderInterceptor
This interceptor looks in the message headers (which contain a partition Id header) and simply redirects the ProducerRecord to the original topic:
#Override
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
int partition = extractSourcePartition(record);
return new ProducerRecord<>(record.topic(), partition, record.timestamp(), record.key(), record.value(), record.headers());
}
My question is: how can I test this interceptor in a test topology (i.e. integration test)?
I've tried adding:
streamsConfiguration.put(StreamsConfig.producerPrefix("interceptor.classes"),
PartitionByHeaderInterceptor.class.getName());
(which is how I enable the interceptor in production code)
to my test topology stream configuration, but my interceptor is not called by the test topology driver.
Is what I'm trying to do currently technically possible?
I have a kafka streams app consuming from kafka topic. It only consumes and processes the data but doesn't produce anything.
For Kafka's exactly_once processing to work, do you also need your streams app to write to a kafka topic?
How can you achieve exactly_once if your streams app wants to process the message only once but not produce anything?
Providing “exactly-once” processing semantics really means that distinct updates to the state of an operator that is managed by the stream processing engine are only reflected once. “Exactly-once” by no means guarantees that processing of an event, i.e. execution of arbitrary user-defined logic, will happen only once.
Above is the "Exactly once" semantics explanation.
It is not necessary to publish the output to a topic always in KStream application.
When you are using KStream applications, you have to define an applicationID with each which uses a consumer in the backend. In the application, you have to configure few
parameters like processing.guarantee to exactly_once and enable.idempotence
Here are the details :
https://kafka.apache.org/22/documentation/streams/developer-guide/config-streams#processing-guarantee
I am not conflicting on exactly-once stream pattern because that's the beauty of Kafka Stream however its possible to use Kafka Stream without producing to other topics.
Exactly-once stream pattern is simply the ability to execute a read-process-write operation exactly one time. This means you consume one message at a time get the process and published to another topic and commit. So commit will be handle by Stream automatically one message a time.
Kafka Stream achieve these be setting below parameters which can not be overwritten
isolation.level: (read_committed) - Consumers will always read committed data only
enable.idempotence: (true) - Producer will always have idempotency enabled
max.in.flight.requests.per.connection" (5) - Producer will always have one in-flight request per connection
In case any error in the consumer or producer Kafka stream always retries a specific configured number of attempts.
KafkaStream doesn't guarantee inside processing logic we still need to handle e.g. there is a requirement for DB operation and if DB connection got failed in that case Kafka doesn't aware so you need to handle by your own.
As per pattern definition yes we need consumer, process, and producer topic but in general, it's not stopping you if you don't output to another topic. Still, you can consume exactly one item at a time with default time interval commit(DEFAULT_COMMIT_INTERVAL_MS) and again you need to handle your logic transaction failure by yourself
I am putting some sample examples.
StreamsBuilder builder = new StreamsBuilder();
Properties props = getStreamProperties();
KStream<String, String> textLines = builder.stream(Pattern.compile("topic"));
textLines.process(() -> new ProcessInternal());
KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
logger.info("Completed VQM stream");
streams.close();
}));
logger.info("Streaming start...");
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
class ProcessInternal implements Processor<String, String> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public void close() {
// Any code for clean up would go here.
}
#Override
public void process(String key, String value) {
///Your transactional process business logic
}
}
I want to receive the latest data from Kafka to Flink Program, but Flink is reading the historical data.
I have set auto.offset.reset to latest as shown below, but it did not work
properties.setProperty("auto.offset.reset", "latest");
Flink Programm is receiving the data from Kafka using below code
//getting stream from Kafka and giving it assignTimestampsAndWatermarks
DataStream<JoinedStreamEvent> raw_stream = envrionment.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties)).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
I was following the discussion on
https://issues.apache.org/jira/browse/FLINK-4280 , which suggests adding the source in below mentioned way
Properties props = new Properties();
...
FlinkKafkaConsumer kafka = new FlinkKafkaConsumer("topic", schema, props);
kafka.setStartFromEarliest();
kafka.setStartFromLatest();
kafka.setEnableCommitOffsets(boolean); // if true, commits on checkpoint if checkpointing is enabled, otherwise, periodically.
kafka.setForwardMetrics(boolean);
...
env.addSource(kafka)
I did the same, however, I was not able to access the setStartFromLatest()
FlinkKafkaConsumer09 kafka = new FlinkKafkaConsumer09<JoinedStreamEvent>( "test", new JoinSchema(),properties);
What should I do to receive the latest values that are being sent to
Kafka rather than receiving values from history?
The problem was solved by creating the new group id named test1 both for the sender and consumer and keeping the topic name same as test.
Now I am wondering, is this the best way to solve this issue? because
every time I need to give a new group id
Is there some way I can just read data that is being sent to Kafka?
I believe this could work for you. It did for me. Modify the properties and your kafka topic.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "ip:port");
properties.setProperty("zookeeper.connect", "ip:port");
properties.setProperty("group.id", "your-group-id");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("your-topic", new SimpleStringSchema(), properties));
stream.writeAsText("your-path", FileSystem.WriteMode.OVERWRITE)
.setParallelism(1);
env.execute();
}
We have a topology that has multiple kafka spout tasks. Each spout task is supposed to read a subset of messages from a set of Kafka topics. Topics have to be subscribed using a wild card such as AAA.BBB.*. The expected behaviour would be that all spout tasks collectively will consume all messages in all of the topics that match the wild card. Each message is only routed to a single spout task (Ignore failure scenarios). Is this currently supported?
Perhaps you could use DynamicBrokersReader class.
Map conf = new HashMap();
...
conf.put("kafka.topic.wildcard.match", true);
wildCardBrokerReader = new DynamicBrokersReader(conf, connectionString, masterPath, "AAA.BBB.*");
List<GlobalPartitionInformation> partitions = wildCardBrokerReader.getBrokerInfo();
...
for (GlobalPartitionInformation eachTopic: partitions) {
StaticHosts hosts = new StaticHosts(eachTopic);
SpoutConfig spoutConfig = new SpoutConfig(hosts, eachTopic.topic, zkRoot, id);
KafkaSpout spout = new KafkaSpout(spoutConfig);
}
... // Wrap those created spout instances into a container
I am using Kafka 0.9.0.1.
The first time I start up my application it takes 20-30 seconds to retrieve the "latest" message from the topic
I've used different Kafka brokers (with different configs) yet I still see this behaviour. There is usually no slowness for subsequent messages.
Is this expected behaviour? you can clearly see this below by running this sample application and changing the broker/topic name to your own settings
public class KafkaProducerConsumerTest {
public static final String KAFKA_BROKERS = "...";
public static final String TOPIC = "...";
public static void main(String[] args) throws ExecutionException, InterruptedException {
new KafkaProducerConsumerTest().run();
}
public void run() throws ExecutionException, InterruptedException {
Properties consumerProperties = new Properties();
consumerProperties.setProperty("bootstrap.servers", KAFKA_BROKERS);
consumerProperties.setProperty("group.id", "Test");
consumerProperties.setProperty("auto.offset.reset", "latest");
consumerProperties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumerProperties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
MyKafkaConsumer kafkaConsumer = new MyKafkaConsumer(consumerProperties, TOPIC);
Executors.newFixedThreadPool(1).submit(() -> kafkaConsumer.consume());
Properties producerProperties = new Properties();
producerProperties.setProperty("bootstrap.servers", KAFKA_BROKERS);
producerProperties.setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
producerProperties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
MyKafkaProducer kafkaProducer = new MyKafkaProducer(producerProperties, TOPIC);
kafkaProducer.publish("Test Message");
}
}
class MyKafkaConsumer {
private final Logger logger = LoggerFactory.getLogger(MyKafkaConsumer.class);
private KafkaConsumer<String, Object> kafkaConsumer;
public MyKafkaConsumer(Properties properties, String topic) {
kafkaConsumer = new KafkaConsumer<String, Object>(properties);
kafkaConsumer.subscribe(Lists.newArrayList(topic));
}
public void consume() {
while (true) {
logger.info("Started listening...");
ConsumerRecords<String, Object> consumerRecords = kafkaConsumer.poll(Long.MAX_VALUE);
logger.info("Received records {}", consumerRecords.iterator().next().value());
}
}
}
class MyKafkaProducer {
private KafkaProducer<String, Object> kafkaProducer;
private String topic;
public MyKafkaProducer(Properties properties, String topic) {
this.kafkaProducer = new KafkaProducer<String, Object>(properties);
this.topic = topic;
}
public void publish(Object object) throws ExecutionException, InterruptedException {
ProducerRecord<String, Object> producerRecord = new ProducerRecord<>(topic, "key", object);
Future<RecordMetadata> response = kafkaProducer.send(producerRecord);
response.get();
}
}
The first message should take longer than the rest because when you start a new consumer in the consumer group specified by the statement consumerProperties.setProperty("group.id", "Test");, Kakfka will balance the partitions such that each partition is consumed by atmost one consumer and will distribute the partitions for the topic across multiple consumer processes.
Also, with Kafka 0.9, there is a seperate __consumer_offsets topic which Kafka uses to manage the offsets for each consumer in a consumer group. It is likely that when you start the consumer for the first time, it looks at this topic to fetch the latest offset (there might have been a consumer consuming from this topic earlier which would have got killed, therefore it is necessary to fetch from the correct offset).
These 2 factors will cause a higher latency in the consumption of first set of messages. I can't comment on the exact latency of 20-30 seconds, but I guess this should be the default behaviour.
PS: The exact number might also depend upon other secondary factors like whether you are running the broker & the consumers on the same machine (where there would be no network latency) or on different ones where they would be communicating using TCP.
According to this link:
Try setting group_id=None in your consumer, or call consumer.close()
before ending script, or use assign() not subscribe (). Otherwise you are
rejoining an existing group that has known but unresponsive members. The
group coordinator will wait until those members checkin/leave/timeout.
Since the consumers no longer exist (it's your prior script runs) they have
to timeout.
And consumer.poll() blocks during group rebalance.
So it is correct behavior if you join group with unresponsively members (maybe you terminate the application ungracefully).
Please confirm you call "consumer.close()" before exiting your application.
Just tried your code with minimal logging additions now many times. Here is a typical log output:
2016-07-24 15:12:51,417 Start polling...|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,604 producer has send message|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,619 producer got response, exiting|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,679 Received records [Test Message]|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:51,679 Start polling...|INFO|KafkaProducerConsumerTest
2016-07-24 15:12:54,680 returning on empty poll result|INFO|KafkaProducerConsumerTest
The sequence of events is as expected and in a timely manner. The consumer starts polling, the producer sends the message and receives a result, the consumer receives the message and all this with 300ms. Then the consumer starts polling again and is thrown out 3 seconds later as I change the poll timeout respectively.
I am using Kafka 0.9.0.1 for broker and client libraries. The connection is on localhost and it is a test environment with no load at all.
For completeness, here is the log form the server that was triggered by the exchange above.
[2016-07-24 15:12:51,599] INFO [GroupCoordinator 0]: Preparing to restabilize group Test with old generation 0 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:12:51,599] INFO [GroupCoordinator 0]: Stabilized group Test generation 1 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:12:51,617] INFO [GroupCoordinator 0]: Assignment received from leader for group Test for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:13:24,635] INFO [GroupCoordinator 0]: Preparing to restabilize group Test with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-07-24 15:13:24,637] INFO [GroupCoordinator 0]: Group Test generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
You may want to compare with your server logs for the same exchange.