How can flink reads newest data from kafka - apache-kafka

Now, in my scenario, flink reads newest data from kafka everytime.
For example,
kafka products:
log1
log2
log3
When read,only log3 is needed.
Kafka consumer API, seekToEnd() can do it.
Does FlinkKafkaConsumer have the same function?

Flink 1.3 has this function.
FlinkKafkaConsumer09 flinkKafkaConsumer09 = new FlinkKafkaConsumer09<>(properties.getProperty("topic"), new RowDeserializationSchema(properties.getProperty("separator"), resultType), properties);
flinkKafkaConsumer09.setStartFromLatest();

Related

kafka producer api 0.8.2.1 is not compatible with 1.0.1 broker?

i was using kafka producer which version is 0.8.2.1 to write to kafka broker which version is 1.0.1 async.
my code is like bellow:
KafkaProducer producer = new KafkaProducer(configs);
ProducerRecord producerRecord = new ProducerRecord("topic", "key", "value");
producer.send(producerRecord, new CallBack(){
#override
public void onCompletion(RecordMetadata metadata,
java.lang.Exception exception){
if(metadata != null){
System.out.println(metadata.partition() + "|" + metadata.offset());
}
});
i found that partition offset printed in my producer app's log at "onCompletion" method was bigger than kafka broker's offset which was query by shell command "./kafka-run-class.sh kafka.tools.GetOffsetShell ".
my producer was set with the config "acks=all"
for example, partition 0's offset is 30000 in log, but is 10000 queryed by shell command.
is it caused by version compatible problem?
The producer API was rewriten around Kafka 0.9 such that offsets are stored in Kafka, not Zookeeper. It's not clear if you've used GetOffsetShell with Zookeeper option or not.
Newer brokers are mostly backwards compatible down to version 0.10.2, but you shouldn't expect older clients less than those versions to work correctly with newer broker versions
https://cwiki.apache.org/confluence/display/KAFKA/Compatibility+Matrix

Flink Kafka connector to eventhub

I am using Apache Flink, and trying to connect to Azure eventhub by using Apache Kafka protocol to receive messages from it. I manage to connect to Azure eventhub and receive messages, but I can't use flink feature "setStartFromTimestamp(...)" as described here (https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-start-position-configuration).
When I am trying to get some messages from timestamp, Kafka said that the message format on the broker side is before 0.10.0.
Is anybody faced with this?
Apache Kafka client version is 2.0.1
Apache Flink version is 1.7.2
UPDATED: tried to use Azure-Event-Hub quickstart examples (https://github.com/Azure/azure-event-hubs-for-kafka/tree/master/quickstart/java) in consumer package added code to get offset with timestamp, it returns null as expected if message version under 0.10.0 kafka version.
List<PartitionInfo> partitionInfos = consumer.partitionsFor(TOPIC);
List<TopicPartition> topicPartitions = partitionInfos.stream().map(pi -> new TopicPartition(pi.topic(), pi.partition())).collect(Collectors.toList());
Map<TopicPartition, Long> topicPartitionToTimestampMap = topicPartitions.stream().collect(Collectors.toMap(tp -> tp, tp -> 0L));
Map<TopicPartition, OffsetAndTimestamp> offsetAndTimestamp = consumer.offsetsForTimes(topicPartitionToTimestampMap);
System.out.println(offsetAndTimestamp);
Sorry we missed this. Kafka offsetsForTimes() is now supported in EH (previously unsupported).
Feel free to open an issue against our Github in the future. https://github.com/Azure/azure-event-hubs-for-kafka

does pyspark support spark-streaming-kafka-0-10 lib?

my kafka cluster version is 0.10.0.0, and i want to use pyspark stream to read kafka data. but in Spark Streaming + Kafka Integration Guide, http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
there is no python code example.
so can pyspark use spark-streaming-kafka-0-10 to integrate kafka?
Thank you in advance for your help !
I also use spark streaming with Kafka 0.10.0 cluster. After adding following line to your code, you are good to go.
spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0
And here a sample in python:
# Initialize SparkContext
sc = SparkContext(appName="sampleKafka")
# Initialize spark stream context
batchInterval = 10
ssc = StreamingContext(sc, batchInterval)
# Set kafka topic
topic = {"myTopic": 1}
# Set application groupId
groupId = "myTopic"
# Set zookeeper parameter
zkQuorum = "zookeeperhostname:2181"
# Create Kafka stream
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, groupId, topic)
#Do as you wish with your stream
# Start stream
ssc.start()
ssc.awaitTermination()
You can use spark-streaming-kafka-0-8 when your brokers are 0.10 and later. spark-streaming-kafka-0-8 supports newer brokers versions while streaming-kafka-0-10 does not support older broker versions. streaming-kafka-0-10 as of now is still experimental and has no Python support.

flink kafka consumer groupId not working

I am using kafka with flink.
In a simple program, I used flinks FlinkKafkaConsumer09, assigned the group id to it.
According to Kafka's behavior, when I run 2 consumers on the same topic with same group.Id, it should work like a message queue. I think it's supposed to work like:
If 2 messages sent to Kafka, each or one of the flink program would process the 2 messages totally twice(let's say 2 lines of output in total).
But the actual result is that, each program would receive 2 pieces of the messages.
I have tried to use consumer client that came with the kafka server download. It worked in the documented way(2 messages processed).
I tried to use 2 kafka consumers in the same Main function of a flink programe. 4 messages processed totally.
I also tried to run 2 instances of flink, and assigned each one of them the same program of kafka consumer. 4 messages.
Any ideas?
This is the output I expect:
1> Kafka and Flink2 says: element-65
2> Kafka and Flink1 says: element-66
Here's the wrong output i always get:
1> Kafka and Flink2 says: element-65
1> Kafka and Flink1 says: element-65
2> Kafka and Flink2 says: element-66
2> Kafka and Flink1 says: element-66
And here is the segment of code:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool parameterTool = ParameterTool.fromArgs(args);
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer09<>(parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties()));
messageStream.rebalance().map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
#Override
public String map(String value) throws Exception {
return "Kafka and Flink1 says: " + value;
}
}).print();
env.execute();
}
I have tried to run it twice and also in the other way:
create 2 datastreams and env.execute() for each one in the Main function.
There was a quite similar question on the Flink user mailing list today, but I can't find the link to post it here. So here a part of the answer:
"Internally, the Flink Kafka connectors don’t use the consumer group
management functionality because they are using lower-level APIs
(SimpleConsumer in 0.8, and KafkaConsumer#assign(…) in 0.9) on each
parallel instance for more control on individual partition
consumption. So, essentially, the “group.id” setting in the Flink
Kafka connector is only used for committing offsets back to ZK / Kafka
brokers."
Maybe that clarifies things for you.
Also, there is a blog post about working with Flink and Kafka that may help you (https://data-artisans.com/blog/kafka-flink-a-practical-how-to).
Since there is not much use of group.id of flink kafka consumer other than commiting offset to zookeeper. Is there any way of offset monitoring as far as flink kafka consumer is concerned. I could see there is a way [with the help of consumer-groups/consumer-offset-checker] for console consumers but not for flink kafka consumers.
We want to see how our flink kafka consumer is behind/lagging with kafka topic size[total number of messages in topic at given point of time], it is fine to have it at partition level.

Kafka 0.8, is it possible to create topic with partition and replication using java code?

In Kafka 0.8beta a topic can be created using a command like below as mentioned here
bin/kafka-create-topic.sh --zookeeper localhost:2181 --replica 2 --partition 3 --topic test
the above command will create a topic named "test" with 3 partitions and 2 replicas per partition.
Can I do the same thing using Java ?
So far what I found is using Java we can create a producer as seen below
Producer<String, String> producer = new Producer<String, String>(config);
producer.send(new KeyedMessage<String, String>("mytopic", msg));
This will create a topic named "mytopic" with the number of partition specified using the "num.partitions" attribute and start producing.
But is there a way to define the partition and replication also ? I couldn't find any such example. If we can't then does that mean we always need to create topic with partitions and replication (as per our requirement) before and then use the producer to produce message within that topic. For example will it be possible if I want to create the "mytopic" the same way but with different number of partition (overriding the num.partitions attribute) ?
Note: My answer covers Kafka 0.8.1+, i.e. the latest stable version available as of April 2014.
Yes, you can create a topic programatically via the Kafka API. And yes, you can specify the desired number of partitions as well as the replication factor for the topic.
Note that the recently released Kafka 0.8.1+ provides a slightly different API than Kafka 0.8.0 (which was used by Biks in his linked reply). I added a code example to create a topic in Kafka 0.8.1+ to my reply to the question How Can we create a topic in Kafka from the IDE using API that Biks was referring to above.
`
import kafka.admin.AdminUtils;
import kafka.cluster.Broker;
import kafka.utils.ZKStringSerializer$;
import kafka.utils.ZkUtils;
String zkConnect = "localhost:2181";
ZkClient zkClient = new ZkClient(zkConnect, 10 * 1000, 8 * 1000, ZKStringSerializer$.MODULE$);
ZkUtils zkUtils = new ZkUtils(zkClient, new ZkConnection(zkConnect), false);
Properties pop = new Properties();
AdminUtils.createTopic(zkUtils, topic.getTopicName(), topic.getPartitionCount(), topic.getReplicationFactor(),
pop);
zkClient.close();`