How to read values from kafka topic using consumer through akka-streams/alpakka-kafka? - scala

Running with a Consumer.plainSource, nothing happens. Isn't this one way of reading from kafka topic through streams .
val consumerSettings2 = ConsumerSettings(system,new StringDeserializer,new StringDeserializer)
.withBootstrapServers("localhost:3333")
.withGroupId("ssss")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
val source: Source[ConsumerRecord[String, String], Consumer.Control] =
Consumer.plainSource(consumerSettings2, Subscriptions.topics("candy"))
val sink =
Sink.foreach[ConsumerRecord[String,String]](x=>println("consumed "+x))
source.runWith(sink)

Related

Flink crash with “java.lang.IllegalArgumentException” after I fetch some data to kafka topic

I have a Flink program consuming a Kafka topic. I use Spark to send some message(JSON String I copied from the topic) to the topic (what I want to do is to manually trigger the flink computation). Then the Flink crashed instantly with following error:
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at kafka.message.Message.sliceDelimited(Message.scala:236)
at kafka.message.Message.payload(Message.scala:218)
at org.apache.flink.streaming.connectors.kafka.internals.SimpleConsumerThread.run(SimpleConsumerThread.java:338)
Anyone can tell me why this happened and how to resolve it ?
There is the spark code I use to write json string to Kafka:
// Connect Kafka
println("Connecting kafka")
val KAFKA_QUEUE_TIME = 5000
val KAFKA_BATCH_SIZE = 16384
val brokerList = "kafka05broker01.cnsuning.com:9092,kafka05broker02.cnsuning.com:9092,kafka05broker03.cnsuning.com:9092"
val props = new Properties
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("partitioner.class", "utils.SimplePartitioner")
props.put("metadata.broker.list", brokerList)
props.put("producer.type", "async")
props.put("queue.time", "5000")
props.put("batch.size", "16384")
val config = new ProducerConfig(props)
val producer = new Producer[AnyRef, AnyRef](config)
// Send Kafka
println("Sending msg to kafka")
val topic = "xxxxxx"
val msg = "xxx"
for (i <- 0 to 1000) {
println(i)
val randomPartition = "" + new Random().nextInt(255)
val message = new KeyedMessage[AnyRef, AnyRef](topic, randomPartition, msg)
producer.send(message)
}
There is the how I consume in flink:
val allActProperties = kafkaPropertiesGen( GroupId, BrokerServer, ZKConnect)
val streamComsumer = new FlinkKafkaConsumer08[TraitRecord](topic, new TraitRecordSchema(), allActProperties)
val stream: DataStream[TraitRecord] = env.addSource(streamComsumer ).setParallelism(12)
Which version of kafka you are using?It's seems the kafka jar version is not corresponding to your kafka version.Or FlinkKafkaConsumer08 is not corresponding to your kafka version?Visit
java.lang.IllegalArgumentException kafka console consumer

Apache Kafka: How to receive latest message from Kafka?

I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you
Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.
You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).
You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.
Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()

kafka and Spark: Get first offset of a topic via API

I am playing with Spark Streaming and Kafka (with the Scala API), and would like to read message from a set of Kafka topics with Spark Streaming.
The following method:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "smallest")
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
reads from Kafka to the latest available offset, but doesn't give me the metadata that I need (since I am reading from a set of topics, I need for every message I read that topic) but this other method KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple2[String, String]](ssc, kafkaParams, currentOffsets, messageHandler) wants explicitly an offset that I don't have.
I know that there is this shell command that gives you the last offset.
kafka-run-class.sh kafka.tools.GetOffsetShell
--broker-list <broker>: <port>
--topic <topic-name> --time -1 --offsets 1
and KafkaCluster.scala is an API that is for developers that used to be public and gives you exactly what I would like.
Hint?
You can use the code from GetOffsetShell.scala kafka API documentation
val consumer = new SimpleConsumer(leader.host, leader.port, 10000, 100000, clientId)
val topicAndPartition = TopicAndPartition(topic, partitionId)
val request = OffsetRequest(Map(topicAndPartition -> PartitionOffsetRequestInfo(time, nOffsets)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(topicAndPartition).offsets
Or you can create new consumer with unique groupId and use it for getting first offset
val consumer=new KafkaConsumer[String, String](createConsumerConfig(config.brokerList))
consumer.partitionsFor(config.topic).foreach(pi => {
val topicPartition = new TopicPartition(pi.topic(), pi.partition())
consumer.assign(List(topicPartition))
consumer.seekToBeginning()
val firstOffset = consumer.position(topicPartition)
...

Spark Streaming + Kafka: how to check name of topic from kafka message

I am using Spark Streaming to read from a list of Kafka Topics.
I am following the official API at this link. The method I am using is:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "largest")
val topics = Set(configuration.getKafkaInputTopic())
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
I am wondering how will the executor read from the message from the list of topics ? What will be their policy? Will they read a topic and then when they finish the messages pass to the other topics?
And most importantly, how can I, after calling this method, check what is the topic of a message in the RDD?
stream.foreachRDD(rdd => rdd.map(t => {
val key = t._1
val json = t._2
val topic = ???
})
I am wondering how will the executor read from the message from the
list of topics ? What will be their policy? Will they read a topic and
then when they finish the messages pass to the other topics?
In the direct streaming approach, the driver is responsible for reading the offsets into the Kafka topics you want to consume. What it does it create a mapping between topics, partitions and the offsets that need to be read. After that happens, the driver assigns each worker a range to read into a specific Kafka topic. This means that if a single worker can run 2 tasks simultaneously (just for the sake of the example, it usually can run many more), then it can potentially read from two separate topics of Kafka concurrently.
how can I, after calling this method, check what is the topic of a
message in the RDD?
You can use the overload of createDirectStream which takes a MessageHandler[K, V]:
val topicsToPartitions: Map[TopicAndPartition, Long] = ???
val stream: DStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicsToPartitions,
mam: MessageAndMetadata[String, String]) => (mam.topic(), mam.message())

Spark-Stream Differentiating Kafka Topics

If I feed my Spark-Streaming Application more than one topic like so:
val ssc = new StreamingContext(sc, Seconds(2))
val topics = Set("raw_1", "raw_2)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
When I run my application how can I figure out from my stream what the difference between which topic it is pulling from? Is there a way to do? If I do something like
val lines = stream.print()
I am getting nothing of differentiation. Is the only way to do it to make the Kafka Message Key a denoting factor?
Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata.
You can find example implementation here .