Kafka producer.send() is Stopped by producer.close() - scala

I am trying to send the output of a word-count problem (in spark- scala) on a kafka topic named "test". See Below Code:
val Dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val lines = Dstream.map(f => f._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.foreachRDD(
rdd => rdd.foreach(
f =>
{
val sendProps = new Properties()
sendProps.put("metadata.broker.list", brokers)
sendProps.put("serializer.class", "kafka.serializer.StringEncoder")
sendProps.put("producer.type", "async")
val config = new ProducerConfig(sendProps)
val producer = new Producer[String, String](config)
producer.send(new KeyedMessage[String, String]"test", f._1 + " " +f._2))
producer.close();
}))
The problem is some words are missing in output randomly. I also noticed that if I removed the statement
producer.close()
there is no data loss.
Does this mean producer.close() interrupts producer.send() before it actually puts data in buffer due to which that particular tuple is not being sent to consumer? If Yes, How shall I close producer without risking data loss?
Above was my initial problem and solved by Vale's answer.
Now, when I change producer.type property again - data goes missing randomly.
sendProps.put("producer.type", "sync")
To clarify producer.send is running for all the words I need to put in output Topic. But, some words go missing and are not displayed in output Kafka Topic.

This is weird. The close() method should wait for the send to have finished, and this was why a close(time) method was introduced: as you can see here.
So, I use Java 7. Is the rdd.foreach operating on each partition inside it? Or is it operating on each Tuple (as I think it's doing)?
If the latter, could you try a rdd.foreachPartition (refer to this)? Because you are creating a producer for each line you take, and I fear this could be causing problems (although theoretically it shouldn't).

Related

Apache Flink: Kafka Producer in IDE execution not working as expected

I have a sample streaming WordCount example written in Flink (Scala). In it, I want to put the result in Kafka using Flink-Kafka producer. But it is not working as expected.
My code is as follows:
object WordCount {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment
.getExecutionEnvironment
.setStateBackend(new RocksDBStateBackend("file:///path/to/checkpoint", true))
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000)
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000)
// prevent the tasks from failing if an error happens in their checkpointing, the checkpoint will just be declined.
env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// prepare Kafka consumer properties
val kafkaConsumerProperties = new Properties
kafkaConsumerProperties.setProperty("zookeeper.connect", "localhost:2181")
kafkaConsumerProperties.setProperty("group.id", "flink")
kafkaConsumerProperties.setProperty("bootstrap.servers", "localhost:9092")
// set up Kafka Consumer
val kafkaConsumer = new FlinkKafkaConsumer[String]("input", new SimpleStringSchema, kafkaConsumerProperties)
println("Executing WordCount example.")
// get text from Kafka
val text = env.addSource(kafkaConsumer)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuples) containing: (word,1)
.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_, 1))
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.mapWithState((in: (String, Int), count: Option[Int]) =>
count match {
case Some(c) => ((in._1, c), Some(c + in._2))
case None => ((in._1, 1), Some(in._2 + 1))
})
// emit result
println("Printing result to stdout.")
counts.map(_.toString()).addSink(new FlinkKafkaProducer[String]("output", new SimpleStringSchema,
kafkaProperties))
// execute program
env.execute("Streaming WordCount")
}
}
The data I sent to Kafka input topic is:
hi
hello
I don't get any output in Kafka topic output. Since I am a newbie to Apache Flink, I don't know how to achieve the expected result. Can anyone help me achieve the correct behavior?
I run your code into my local environment, and everything is OK. I think you can try the command below:
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic output --from-beginning

Spark Streaming + Kafka: how to check name of topic from kafka message

I am using Spark Streaming to read from a list of Kafka Topics.
I am following the official API at this link. The method I am using is:
val kafkaParams = Map("metadata.broker.list" -> configuration.getKafkaBrokersList(), "auto.offset.reset" -> "largest")
val topics = Set(configuration.getKafkaInputTopic())
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
I am wondering how will the executor read from the message from the list of topics ? What will be their policy? Will they read a topic and then when they finish the messages pass to the other topics?
And most importantly, how can I, after calling this method, check what is the topic of a message in the RDD?
stream.foreachRDD(rdd => rdd.map(t => {
val key = t._1
val json = t._2
val topic = ???
})
I am wondering how will the executor read from the message from the
list of topics ? What will be their policy? Will they read a topic and
then when they finish the messages pass to the other topics?
In the direct streaming approach, the driver is responsible for reading the offsets into the Kafka topics you want to consume. What it does it create a mapping between topics, partitions and the offsets that need to be read. After that happens, the driver assigns each worker a range to read into a specific Kafka topic. This means that if a single worker can run 2 tasks simultaneously (just for the sake of the example, it usually can run many more), then it can potentially read from two separate topics of Kafka concurrently.
how can I, after calling this method, check what is the topic of a
message in the RDD?
You can use the overload of createDirectStream which takes a MessageHandler[K, V]:
val topicsToPartitions: Map[TopicAndPartition, Long] = ???
val stream: DStream[(String, String)] =
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
topicsToPartitions,
mam: MessageAndMetadata[String, String]) => (mam.topic(), mam.message())

Spark Streaming using Kafka: empty collection exception

I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
})
dStreamDst.print()
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Update:
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
rdd
else {
// rest of transformation
}
})

Spark-Stream Differentiating Kafka Topics

If I feed my Spark-Streaming Application more than one topic like so:
val ssc = new StreamingContext(sc, Seconds(2))
val topics = Set("raw_1", "raw_2)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
When I run my application how can I figure out from my stream what the difference between which topic it is pulling from? Is there a way to do? If I do something like
val lines = stream.print()
I am getting nothing of differentiation. Is the only way to do it to make the Kafka Message Key a denoting factor?
Yes, you can use MessageAndMetadata version of createDirectStream which allows you to access message metadata.
You can find example implementation here .

Spark Streaming MQTT

I've been using spark to stream data from kafka and it's pretty easy.
I thought using the MQTT utils would also be easy, but it is not for some reason.
I'm trying to execute the following piece of code.
val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val actorSystem = ActorSystem()
implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor])
MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
.foreachRDD { rdd =>
println("got rdd: " + rdd.toString())
rdd.foreach { msg =>
println("got msg: " + msg)
}
}
ssc.start()
ssc.awaitTermination()
The weird thing is that spark logs the msg I sent in the console, but not my println.
It logs something like this:
19:38:18.803 [RecurringTimer - BlockGenerator] DEBUG
o.a.s.s.receiver.BlockGenerator - Last element in
input-0-1435790298600 is SOME MESSAGE
foreach is a distributed action, so your println may be executing on the workers. If you want to see some of the messages printed out locally, you could use the built in print function on the DStream or instead of your foreachRDD collect (or take) some of the elements back to the driver and print them there. Hope that helps and best of luck with Spark Streaming :)
If you wish to just print incoming messages, try something like this instead of the for_each (translating from a working Python version, so do check for Scala typos):
val mqttStream = MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
mqttStream.print()