Consuming from the beginning of a kafka topic with Flink - scala

How do I make sure I always consume from the beginning of a Kafka topic with Flink?
With the Kafka 0.9.x consumer that is part of Flink 1.0.2, it appears that it's no longer Kafka but Flink to control the offset:
Flink snapshots the offsets internally as part of its
distributed checkpoints. The offsets committed to Kafka / ZooKeeper
are only to bring the outside view of progress in sync with Flink's
view of the progress. That way, monitoring and other jobs can get a
view of how far the Flink Kafka consumer has consumed a topic.
This is how far I got, but my Flink program always starts where it left off, and doesn't return to the beginning as the configuration instructs it to:
val props = new Properties()
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", "myflinkservice")
props.setProperty("auto.offset.reset", "earliest")
val incomingData = env.addSource(
new FlinkKafkaConsumer09[IncomingDataRecord](
"my.topic.name",
new IncomingDataSchema,
props
)
)

Use:
consumer.setStartFromEarliest();

I think you can get around this by specifying a random group.id:
val props = new Properties()
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", s"myflinkservice_${UUID.randomUUID}")
props.setProperty("auto.offset.reset", "smallest") // "smallest", not "earliest"
auto.offset.reset only works when there's no initial offset available in ZooKeeper.

Related

How to consume Kafka Topic in Consumer MQ Topic

I have a requirement where I need to consume Kafka Topic and write it into MQ Topic. Can someone advise me the best way to do it, I am new to Kafka.
I have read about the IBM MQ Connector in confluent but could not get the idea how to implement it.
The best way to move data from Kafka to MQ is to use the IBM MQ sink connector: https://github.com/ibm-messaging/kafka-connect-mq-sink
This is a Kafka Connect connector. The README contains details for building and running it.
Kafka has a component called Kafka Connect. It is used to read and write data to/from Kafka into other systems such as Database in your case MQ.
Kafka connect can have two kind of connectors
Source connectors - Read data from an external system and write to Kafka (For eg. Read inserted/modified rows from a table in DB and insert into a topic in Kafka)
Sink Connector - Read message from Kafka write to external system.
The link you have added is a source connector, it will read messages from the MQ and write to Kafka.
For simple use case you do not need Kafka connect. You can write a simple Kafka consumer that will read data from Kafka topic and write it to MQ.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
//Insert code to append to MQ here
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}

Too many UnknownProducerIdException in kafka broker for kafka internal topics created for kafka streams application

One of Kafka stream application is generating a lot of Unknown Producer Id errors in the Kafka brokers as well as on the consumer side.
Stream Configs are as below:
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, appName);
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG,appName + "-Client");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, this.bootstrapServer);
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Long().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG,StreamsConfig.EXACTLY_ONCE);
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, offset);
streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG,state_dir);
streamsConfiguration.put(StreamsConfig.REPLICATION_FACTOR_CONFIG,defaultReplication);
return streamsConfiguration;
Error on the broker side:
Error on the consumer side:
custom configuration for repartition internal topic:
prod.Prod-Job-Summary-v0.4-KTABLE-AGGREGATE-STATE-STORE-0000000049-repartition
What can be the reason behind these?
It's a known issue. See KAFKA-7190
Under low traffic conditions purging repartition topics cause WARN statements about UNKNOWN_PRODUCER_ID and KIP-360: Improve handling of unknown producer.

Kafka spout integration

I am using kafka 0.10.1.1 and storm 1.0.2. In the storm documentation for kafka integration , i can see that offsets are still maintained using zookeeper as we are initializing kafka spout using zookeeper servers.
How can i bootstrap the spout using kafka servers .Is there any example for this .
Example from storm docs
BrokerHosts hosts = new ZkHosts(zkConnString);
SpoutConfig spoutConfig = new SpoutConfig(hosts, topicName, "/" + topicName, UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
This option using zookeeper is working fine and is consuming the messages . but i was not able to see the consumer group or storm nodes as consumers in kafkamanager ui .
Alternate approach tried is this .
KafkaSpoutConfig<String, String> kafkaSpoutConfig = newKafkaSpoutConfig();
KafkaSpout<String, String> spout = new KafkaSpout<>(kafkaSpoutConfig);
private static KafkaSpoutConfig<String, String> newKafkaSpoutConfig() {
Map<String, Object> props = new HashMap<>();
props.put(KafkaSpoutConfig.Consumer.BOOTSTRAP_SERVERS, bootstrapServers);
props.put(KafkaSpoutConfig.Consumer.GROUP_ID, GROUP_ID);
props.put(KafkaSpoutConfig.Consumer.KEY_DESERIALIZER,
"org.apache.kafka.common.serialization.StringDeserializer");
props.put(KafkaSpoutConfig.Consumer.VALUE_DESERIALIZER,
"org.apache.kafka.common.serialization.StringDeserializer");
props.put(KafkaSpoutConfig.Consumer.ENABLE_AUTO_COMMIT, "true");
String[] topics = new String[1];
topics[0] = topicName;
KafkaSpoutStreams kafkaSpoutStreams =
new KafkaSpoutStreamsNamedTopics.Builder(new Fields("message"), topics).build();
KafkaSpoutTuplesBuilder<String, String> tuplesBuilder =
new KafkaSpoutTuplesBuilderNamedTopics.Builder<>(new TuplesBuilder(topicName)).build();
KafkaSpoutConfig<String, String> spoutConf =
new KafkaSpoutConfig.Builder<>(props, kafkaSpoutStreams, tuplesBuilder).build();
return spoutConf;
}
But this solution is showing CommitFailedException after reading few messages from kafka.
Storm-kafka writes consumer information in a different location and different format in zookeeper with common kafka client. So you can't see it in kafkamanager ui.
You can find some other monitor tools, like
https://github.com/keenlabs/capillary.
On your alternate approach, you're likely getting CommitFailedException due to:
props.put(KafkaSpoutConfig.Consumer.ENABLE_AUTO_COMMIT, "true");
Up to Storm 2.0.0-SNAPSHOT (and since 1.0.6) -
KafkaConsumer autocommit is unsupported
From the docs:
Note that KafkaConsumer autocommit is unsupported. The
KafkaSpoutConfig constructor will throw an exception if the
"enable.auto.commit" property is set, and the consumer used by the
spout will always have that property set to false. You can configure
similar behavior to autocommit through the setProcessingGuarantee
method on the KafkaSpoutConfig builder.
References:
http://storm.apache.org/releases/2.0.0-SNAPSHOT/storm-kafka-client.html
http://storm.apache.org/releases/1.0.6/storm-kafka-client.html

Flink with Kafka Consumer doesn't work

I want to benchmark Spark vs Flink, for this purpose I am making several tests. However Flink doesn't work with Kafka, meanwhile with Spark works perfect.
The code is very simple:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "myGroup")
println("topic: "+args(0))
val stream = env.addSource(new FlinkKafkaConsumer09[String](args(0), new SimpleStringSchema(), properties))
stream.print
env.execute()
I use kafka 0.9.0.0 with the same topics (in consumer[Flink] and producer[Kafka console]), but when I send my jar to the cluster, nothing happens:
Cluster Flink
What it could be happening?
Your stream.print will not print in console on flink .It will write to flink0.9/logs/recentlog. Other-wise you can add your own logger for confirming output.
For this particular case (a Source chained into a Sink) the Webinterface will never report Bytes/Records sent/received. Note that this will change in the somewhat near future.
Please check whether the job-/taskmanager logs do not contain any output.

How can I get the group.id of a topic in command line in Kafka?

I installed kafka on my server and want to learn how to use it,
I found a sample code written by scala, below is part of it,
def createConsumerConfig(zookeeper: String, groupId: String): ConsumerConfig = {
val props = new Properties()
props.put("zookeeper.connect", zookeeper)
props.put("group.id", groupId)
props.put("auto.offset.reset", "largest")
props.put("zookeeper.session.timeout.ms", "400")
props.put("zookeeper.sync.time.ms", "200")
props.put("auto.commit.interval.ms", "1000")
val config = new ConsumerConfig(props)
config
}
but I don't know how to find the group id on my server.
The group id is something you define yourself for your consumer by providing a string id for it. All consumers started with the same id will "cooperate" and read topics in a coordinated way where each consumer instance will handle a subset of the messages in a topic. Providing a non-existent group id will be considered to be a new consumer and create a new entry in Zookeeper where committed offsets will be stored.
You could get a Zookeeper shell and list path where Kafka stores consumers' offsets like this:
./bin/zookeeper-shell.sh localhost:2181
ls /consumers
You'll get a list of all groups.
EDIT: I missed the part where you said that you're setting this up yourself so I thought that you want to list the consumer groups of an existing cluster.
Lundahl is right, this is a property that you define, which is used to coordinate consumer threads so that they don't consume "each other's" messages (each consumes a subset). If you, for example, use 2 consumers with different groups, they'll each consume the whole topic.
/kafkadir/kafka-consumer-groups.sh --all-topics --bootstrap-server hostname:port --list