Kafka connection with Flink is not working only with YARN deployment - apache-kafka

I developped a simple Flink job which consumes a Kafka stream. All is working great on local environment. But when I try to run my Flink job on my cluster with YARN, nothing is coming. (And, I don't have any error message).
./bin/flink run -m yarn-cluster -yn 4 /Flink/Flink_Stream_Kafka/jar-with-dependencies.jar --server X.X.X.X:9092 --topic logs
I have of course checked if my cluster had access to the stream and it has. I can even run my job as a simple java program on each machine of my cluster and it works.
Any insights to explain this ?
Thank you
EDIT :
object KafkaConsuming {
def main(args: Array[String]) {
// Flink parameter tool
// Allow to pass arguments to this script
val params: ParameterTool = ParameterTool.fromArgs(args)
// set up streaming execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
// make parameters available globally
env.getConfig.setGlobalJobParameters(params)
val properties = new Properties();
properties.setProperty("bootstrap.servers", params.get("server"));
// only requied for Kafka 0.8
//properties.rsetProperty("zookeeper.connect", "X.X.X.X:2181");
val stream: DataStream[String] = env.addSource(new FlinkKafkaConsumer010[String](params.get("topic"), new SimpleStringSchema(), properties))
//stream.print()
val logs: DataStream[MinifiedLog] = stream.map(x => LogParser2.parse(x))
val sessions = logs.map { x => (x.timestamp, x.sent, 1l)}
val sessionCnt: DataStream[(Long, Long, Long)] = sessions
// key stream by sensorId
.keyBy(2)
// tumbling time window of 1 minute length
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce( (x: (Long, Long, Long), y: (Long, Long, Long)) => (x._1, x._2 + y._2, x._3 + y._3))
.map { z => (z._1, z._2 / 10, z._3 / 10)}
sessionCnt.print()
env.execute("Kafka consuming")
}
}

The output of sessionCnt.print() will be written to the standard output on the TaskManagers.
On YARN, the easiest way to access that output is to retrieve all logs using the aggregated YARN logs (yarn logs -applicationId <appid>).
I'm not sure if the access to standard out files in the Flink UI works correctly.
Another way of checking if data arrives is using the Flink Web UI.

Related

Kafka + Spark ERROR MicroBatchExecution: Query

I'm trying to run the program specified in this IBM Developer code pattern. For now, I am only doing the local deployment https://github.com/IBM/kafka-streaming-click-analysis?cm_sp=Developer-_-determine-trending-topics-with-clickstream-analysis-_-Get-the-Code
Since it's a little old, my versions of Kafka and Scala aren't exactly what the code pattern calls for. The versions I am using are:
Scala: 2.4.6
Kafka 0.10.2.1
At the last step, I get the following error:
ERROR MicroBatchExecution: Query [id = f4dfe12f-1c99-427e-9f75-91a77f6e51a7,
runId = c9744709-2484-4ea1-9bab-28e7d0f6b511] terminated with error
org.apache.spark.sql.catalyst.errors.package$TreeNodeException
Along with the execution tree
The steps I am following are as follows:
1. Start Zookeeper
2. Start Kafka
3. cd kafka_2.10-0.10.2.1
4. tail -200 data/2017_01_en_clickstream.tsv | bin/kafka-console-producer.sh --broker-list ip:port --topic clicks --producer.config=config/producer.properties
I have downloaded the dataset and stored it in a directory called data inside of the kafka_2.10-0.10.2.1 directory
cd $SPARK_DIR
bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.6
Since SPARK_DIR wasn't set during the Spark installation, I am navigating the the directory containing spark to run this command
scala> import scala.util.Try
scala> case class Click(prev: String, curr: String, link: String, n: Long)
scala> def parseVal(x: Array[Byte]): Option[Click] = {
val split: Array[String] = new Predef.String(x).split("\\t")
if (split.length == 4) {
Try(Click(split(0), split(1), split(2), split(3).toLong)).toOption
} else
None
}
scala> val records = spark.readStream.format("kafka")
.option("subscribe", "clicks")
.option("failOnDataLoss", "false")
.option("kafka.bootstrap.servers", "localhost:9092").load()
scala>
val messages = records.select("value").as[Array[Byte]]
.flatMap(x => parseVal(x))
.groupBy("curr")
.agg(Map("n" -> "sum"))
.sort($"sum(n)".desc)
val query = messages.writeStream
.outputMode("complete")
.option("truncate", "false")
.format("console")
.start()
The last statement, query=... is giving the error mentioned above. Any help would be greatly appreciated. Thanks in advance!
A required library or dependency for interacting with Apache Kafka is missing, so you may need to install the missing library or update to a compatible version

No File writen down to HDFS in flink

I'm trying to consume kafka by flink and save the result to hdfs but no file was produces all the time.. and no error message raise up..
btw, it's ok to save to local file but when I change the path to hdfs, I got nothing.
object kafka2Hdfs {
private val ZOOKEEPER_HOST = "ip1:2181,ip2:2181,ip3:2181"
private val KAFKA_BROKER = "ip1:9092,ip2:9092,ip3:9092"
private val TRANSACTION_GROUP = "transaction"
val topic = "tgt3"
def main(args : Array[String]){
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.enableCheckpointing(1000L)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// configure Kafka consumer
val kafkaProps = new Properties()
.... //topic infos
kafkaProps.setProperty("fs.default-scheme", "hdfs://ip:8020")
val consumer = new FlinkKafkaConsumer010[String](topic, new SimpleStringSchema(), kafkaProps)
val source = env.addSource(consumer)
val path = new Path("/user/jay/data")
// sink
val rollingPolicy : RollingPolicy[String,String] = DefaultRollingPolicy.create()
.withRolloverInterval(15000)
.build()
val sink: StreamingFileSink[String] = StreamingFileSink
.forRowFormat(path, new SimpleStringEncoder[String]("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build()
source.addSink(sink)
env.execute("test")
}
}
I'm very confused..
Off the top of my head, there could be two things to look into:
Is the HDFS namenode properly configured so that Flink knows it tries to write to HDFS instead of local disk?
What do the nodemanger and taskmanager logs say? it could fail due to permission issue on HDFS.

Apache Flink: Kafka Producer in IDE execution not working as expected

I have a sample streaming WordCount example written in Flink (Scala). In it, I want to put the result in Kafka using Flink-Kafka producer. But it is not working as expected.
My code is as follows:
object WordCount {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment
.getExecutionEnvironment
.setStateBackend(new RocksDBStateBackend("file:///path/to/checkpoint", true))
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000)
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000)
// prevent the tasks from failing if an error happens in their checkpointing, the checkpoint will just be declined.
env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// prepare Kafka consumer properties
val kafkaConsumerProperties = new Properties
kafkaConsumerProperties.setProperty("zookeeper.connect", "localhost:2181")
kafkaConsumerProperties.setProperty("group.id", "flink")
kafkaConsumerProperties.setProperty("bootstrap.servers", "localhost:9092")
// set up Kafka Consumer
val kafkaConsumer = new FlinkKafkaConsumer[String]("input", new SimpleStringSchema, kafkaConsumerProperties)
println("Executing WordCount example.")
// get text from Kafka
val text = env.addSource(kafkaConsumer)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuples) containing: (word,1)
.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_, 1))
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.mapWithState((in: (String, Int), count: Option[Int]) =>
count match {
case Some(c) => ((in._1, c), Some(c + in._2))
case None => ((in._1, 1), Some(in._2 + 1))
})
// emit result
println("Printing result to stdout.")
counts.map(_.toString()).addSink(new FlinkKafkaProducer[String]("output", new SimpleStringSchema,
kafkaProperties))
// execute program
env.execute("Streaming WordCount")
}
}
The data I sent to Kafka input topic is:
hi
hello
I don't get any output in Kafka topic output. Since I am a newbie to Apache Flink, I don't know how to achieve the expected result. Can anyone help me achieve the correct behavior?
I run your code into my local environment, and everything is OK. I think you can try the command below:
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic output --from-beginning

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

Why will I see the periodic pulses in processing time chart when using mapWithState/checkpoint in spark streaming?

I write a stateful-wordCount spark streaming application which can receive data from Kafka continuously. My code includes a mapWithState function and can run correctly. When I check the Streaming Statistics at spark UI, I found some periodic pulses in Processing Time chart. I think this may be caused by the usage of checkpoint. Hope someone can explain this, great thanks!
and the completed batches table:
I find some 1-second-time-cost batches occur periodicly. Then I step into a 1-second-time-cost batch and a subsecond-time-cost batch and found the 1-second-time-cost batch has one more job then the other.
Comparing two kinds of batches:
It seems to be caused by the checkpoint, but I'm not sure.
Can anyone explain it in detail for me? THANKS!
Here is my code:
import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
object StateApp {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println(
s"""
|Usage: KafkaSpark_008_test <brokers> <topics> <batchDuration>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
| <batchDuration> is the batch duration of spark streaming
| <checkpointPath> is the checkpoint directory
""".stripMargin)
System.exit(1)
}
val Array(brokers, topics, bd, cpp) = args
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("KafkaSpark_080_test")
val ssc = new StreamingContext(sparkConf, Seconds(bd.toInt))
ssc.checkpoint(cpp)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
// test the messages' receiving speed
messages.foreachRDD(rdd =>
println(System.currentTimeMillis() + "\t" + System.currentTimeMillis() / 1000 + "\t" + (rdd.count() / bd.toInt).toString))
// the messages' value type is "timestamp port word", eg. "1479700000000 10105 ABC"
// wordDstream: (word, 1), eg. (ABC, 1)
val wordDstream = messages.map(_._2).map(msg => (msg.split(" ")(2), 1))
// this is from Spark Source Code example in Streaming/StatefulNetworkWordCount.scala
val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
val output = (word, sum)
state.update(sum)
output
}
val stateDstream = wordDstream.mapWithState(
StateSpec.function(mappingFunc)).print()
// Start the computation
ssc.start()
ssc.awaitTermination() }
}
These little spikes you see are caused by checkpointing your data to persistent storage. In order for Spark to do state full transformations it needs to reliably store your data at every defined interval to be able to recover in case of failure.
Notice the spikes are consistent in time as they execute every 50 seconds. This calculation is: (batch time * default multiplier), where the current default multiplier is 10. In your case this is 5 * 10 = 50 which explains why the spike is visible every 50 seconds.