We have a topology that has multiple kafka spout tasks. Each spout task is supposed to read a subset of messages from a set of Kafka topics. Topics have to be subscribed using a wild card such as AAA.BBB.*. The expected behaviour would be that all spout tasks collectively will consume all messages in all of the topics that match the wild card. Each message is only routed to a single spout task (Ignore failure scenarios). Is this currently supported?
Perhaps you could use DynamicBrokersReader class.
Map conf = new HashMap();
...
conf.put("kafka.topic.wildcard.match", true);
wildCardBrokerReader = new DynamicBrokersReader(conf, connectionString, masterPath, "AAA.BBB.*");
List<GlobalPartitionInformation> partitions = wildCardBrokerReader.getBrokerInfo();
...
for (GlobalPartitionInformation eachTopic: partitions) {
StaticHosts hosts = new StaticHosts(eachTopic);
SpoutConfig spoutConfig = new SpoutConfig(hosts, eachTopic.topic, zkRoot, id);
KafkaSpout spout = new KafkaSpout(spoutConfig);
}
... // Wrap those created spout instances into a container
Related
I have a Kafka Streams application that reads from a set of topics, enriches the messages with some extra data and then outputs onto another set of topics eg:
topic.blue.unprocessed -> Kafka Streams App -> topic.blue.processed
topic.yellow.unprocessed -> Kafka Streams App -> topic.yellow.processed
The consumer group is set up with a regex topic pattern and will read topics with the prefix topic.
This was working just fine for some time but I recently noticed it had stopped reading messages from some of the topics eg. no messages from topic.yellow.unprocessed are being read but topic.blue.unprocessed is still functioning fine.
I investigated the logs and could see that the app was still reading from topic.yellow.unprocessed a month ago, however there was a large delay of 5 days from when the message appeared on the topic until it was read by the Streams application. Now it is not reading them at all.
Wondering if anyone has an idea why this may be occurring for only some topics - I would expect if there was an issue with the app or consumer ACL that it would affect all topics but I'm not seeing that happening.
I have confirmed topic.yellow.unprocessed is deployed and is receiving messages - they just are not being consumed by the application. Debug logs are enabled but are showing nothing.
See below consumer code:
#Value("${kafka.configuration.inputTopicRegex}")
private String inputTopicRegex;
#Value("${kafka.configuration.deadLetterTopic}")
private String deadLetterTopic;
#Value("${kafka.configuration.brokerAddress}")
private String brokerAddress;
#Autowired
AvroRecodingSerde avroSerde;
public KafkaStreams createStreams() {
return new KafkaStreams(createTopology(), createKakfaProperties());
}
private Properties createKakfaProperties() {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "topic.color.app");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokerAddress);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class.getName());
config.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 86400000);
return config;
}
public Topology createTopology() {
StreamsBuilder builder = new StreamsBuilder();
// stream of records
KStream<String, GenericRecord> ingressStream = builder.stream(Pattern.compile(inputTopicRegex), Consumed.with(Serdes.String(), avroSerde));
KStream<String, GenericRecord> processedStream = ingressStream.transformValues(enrichMessage);
processedStream.to(destinationOrDeadletter, Produced.with(Serdes.String(), avroSerde));
return builder.build();
}
We want to achieve parallelism while reading a message form kafka. hence we wanted to specify partition number in flinkkafkaconsumer. It will read messages from all partition in kafka instead of specific partition number. Below is sample code:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "Message-Test-Consumers");
properties.setProperty("partition", "1"); //not sure about this syntax.
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<String>("EventLog", new SimpleStringSchema(), properties);
Please suggest any better option to get the parallelism.
I don't believe there is a mechanism to restrict which partitions Flink will read from. Nor do I see how this would help you achieve your goal of reading from the partitions in parallel, which Flink does regardless.
The Flink Kafka source connector reads from all available partitions, in parallel. Simply set the parallelism of the kafka source connector to whatever parallelism you desire, keeping in mind that the effective parallelism cannot exceed the number of partitions. In this way, each instance of Flink's Kafka source connector will read from one or more partitions. You can also configure the kafka consumer to automatically discover new partitions that may be created while the job is running.
I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
Working with Kafka(v2.11-0.10.1.0)-spark-streaming(v-2.0.1-bin-hadoop2.7).
I have Kafka Producer and Spark-streaming consumer to produce and consume. All works fine till I stop consumer(for approx 2-min) and start again. The consumer starts and reads data, absolutely perfect. But, I'm lost with the 2-min data, where consumer was off.
Kafka consumer/server.properties are unchanged.
Kafka producer with properties:
Properties properties = new Properties();
properties.put("bootstrap.servers", AppCoding.KAFKA_HOST);
properties.put("auto.create.topics.enable", true);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("retries", 1);
logger.info("Initializing Kafka Producer.");
Producer<String, String> producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(AppCoding.KAFKA_TOPIC, "", documentAsString));
Consuming using Spark-streaming api as:
SparkConf sparkConf = new SparkConf().setMaster(args[4]).setAppName("Streaming");
// Create the context with 60 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60000 * 5));
//input arguments:localhost:2181 sparkS incoming 10 local[*]
Set<String> topicsSet = new HashSet<>(Arrays.asList(args[2].split(";")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
//input arguments: localhost:9092 "" incoming 10 local[*]
JavaPairInputDStream<String, String> kafkaStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
On the other end i have been using ActiveMQ. While ActiveMQ Consumer could fetch me the data while its off.
Help me out if there's a confuguration problem.
In Kafka, consumers actually have no direct relationship with producers. Each consumer has an offset which tracks what has been consumed in the partitions. If a consumer has no offset tracked, Kafka will automatically reset its offset to the largest one because of the default value of config 'auto.offset.reset'. In your case, when the brand-new consumer is started, due to the default policy, it does not see the messages produced previously. You could set 'auto.offset.reset' to earliest (for new consumer) or smallest (for old consumer).
Kafka maintains offset per partition per record basis. While consumer was off for 2 minute duration, offset value would be stored in topic metadata for new-consumer, and again when the consumer is started back after 2minutes, it would read last offset which was stored in kafka topic.
I think what you need to check is kafka broker data retention policy if it is less than 2 minutes , data would be lost , if data corresponding to offset is not present , it would start reading from latest as by default value is set to latest auto.offset.reset=latest for new data arriving.
I would suggest to check and change kafka data retention policy accordingly if it is less than 2 minutes
I am using storm-kafka-0.9.3 to read data from the Kafka and process those data in Storm. Below is the Kafka Spout I am using.But the problem is when I kill the Storm cluster, it does not read old data which was sent during the time it was dead, it start reading from the latest offset.
BrokerHosts hosts = new ZkHosts(Constants.ZOOKEEPER_HOST);
SpoutConfig spoutConfig = new SpoutConfig(hosts, CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME
, "/" + CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME,UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
//Never should make this true
spoutConfig.forceFromStart=false;
spoutConfig.startOffsetTime =-2;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
return kafkaSpout;
Thanks All,
Since I was running the Topology in Local mode,Storm did not store Offset in ZK, when I ran the topology in Prod mode It got resolved.
Sougata
I believe this might happen because while the topology is running it used to keep all the state information to zookeeper using the following path SpoutConfig.zkRoot+ "/" + SpoutConfig.id so that in case of failure it can resume from the last written offset in zookeeper.
Got this from the doc
Important:When re-deploying a topology make sure that the settings for SpoutConfig.zkRoot and SpoutConfig.id were not modified, otherwise the spout will not be able to read its previous consumer state information (i.e. the offsets) from ZooKeeper -- which may lead to unexpected behavior and/or to data loss, depending on your use case.
In your case as the SpoutConfig.id is a random value UUID.randomUUID().toString() Its not able to retrieve the last committed offset.
Also read from the same page
when a topology has run once the setting KafkaConfig.startOffsetTime will not have an effect for subsequent runs of the topology because now the topology will rely on the consumer state information (offsets) in ZooKeeper to determine from where it should begin (more precisely: resume) reading. If you want to force the spout to ignore any consumer state information stored in ZooKeeper, then you should set the parameter KafkaConfig.ignoreZkOffsets to true. If true, the spout will always begin reading from the offset defined by KafkaConfig.startOffsetTime as described above
You could possibly use a static id to see if it is able to retrieve.
You need to set the spoutConfig.zkServers and spoutConfig.zkPort :
BrokerHosts hosts = new ZkHosts(Constants.ZOOKEEPER_HOST);
SpoutConfig spoutConfig = new SpoutConfig(hosts, CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME
, "/" + CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME,"test");
spoutConfig.zkPort=Constants.ZOOKEEPER_PORT;
spoutConfig.zkServers=Constants.ZOOKEEPER_SERVERS;
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
return kafkaSpout;