Spark Streaming - Join on multiple kafka stream operation is slow - scala

I have 3 kafka streams having 600k+ records each, spark streaming takes more than 10 mins to process simple joins between streams.
Spark Cluster config:
This is how i'm reading kafka streams to tempviews in spark(scala)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "KAFKASERVER")
.option("subscribe", TOPIC1)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest").load()
.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=SCHEMA1).as("data"))
.select($"COL1", $"COL2")
.createOrReplaceTempView("TABLE1")
I join 3 TABLES using spark spark sql
select COL1, COL2 from TABLE1
JOIN TABLE2 ON TABLE1.PK = TABLE2.PK
JOIN TABLE3 ON TABLE2.PK = TABLE3.PK
Execution of Job:
Am i missing out some configuration on spark that i've to look into?

I find the same problem. And I found join between stream and stream needs more memory as I image. And the problem disappear when I increase the cores per executor.

unfortunately there wasn't any test data nor the result data that you expected to be so I could play with, so I cannot give the exact proper answer.
#Asteroid comment is valid, as we see the number of task for each stage is 1. Normally Kafka stream use receiver to consume the topic; and each receiver only create one tasks. One approach is to use multiple receivers / split partition / Increase your resources (# of core) to increase parallelism.
If this still not working, another way is to use Kafka API to createDirectStream. According to the documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html, this one creates an input stream that directly pulls messages from Kafka Brokers without using any receiver.
I premilinary crafted a sample code for creating direct stream below. You might want to learn about this to customize to you own preference.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "KAFKASERVER",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"startingOffsets" -> "earliest",
"endingOffsets" -> "latest"
)
val topics = Array(TOPIC1)
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val schema = StructType(StructField('data', StringType, True))
val df = spark.createDataFrame([], schema)
val dstream = stream.map(_.value())
dstream.forEachRDD(){rdd:RDD[String], time:Time} => {
val tdf = spark.read.schema(schema).json(rdd)
df = df.union(tdf)
df.createOrReplaceTempView("TABLE1")
}
Some related materials:
https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (Scroll down to Kafka Consumer Code portion. The other section is irrelevant)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (Spark Doc for create direct stream)
Good luck!

Related

Spark Streaming 1.6 + Kafka: Too many batches in "queued" status

I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. I'm using direct approach to consume from kafka and the code can be found below:
def createStreamingContext(conf: Conf): StreamingContext = {
val dateFormat = conf.dateFormat.apply
val hiveTable = conf.tableName.apply
val sparkConf = new SparkConf()
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConf.set("spark.driver.allowMultipleContexts", "true")
val sc = SparkContextBuilder.build(Some(sparkConf))
val ssc = new StreamingContext(sc, Seconds(conf.batchInterval.apply))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> conf.kafkaBrokers.apply,
"key.deserializer" -> classOf[StringDeserializer].getName,
"value.deserializer" -> classOf[StringDeserializer].getName,
"auto.offset.reset" -> "smallest",
"enable.auto.commit" -> "false"
)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
conf.topics.apply().split(",").toSet[String]
)
val windowedKafkaStream = directKafkaStream.window(Seconds(conf.windowDuration.apply))
ssc.checkpoint(conf.sparkCheckpointDir.apply)
val eirRDD: DStream[Row] = windowedKafkaStream.map { kv =>
val fields: Array[String] = kv._2.split(",")
createDomainObject(fields, dateFormat)
}
eirRDD.foreachRDD { rdd =>
val schema = SchemaBuilder.build()
val sqlContext: HiveContext = HiveSQLContext.getInstance(Some(rdd.context))
val eirDF: DataFrame = sqlContext.createDataFrame(rdd, schema)
eirDF
.select(schema.map(c => col(c.name)): _*)
.write
.mode(SaveMode.Append)
.partitionBy("year", "month", "day")
.insertInto(hiveTable)
}
ssc
}
As it can be seen from the code, I used window to achieve this (and please correct me if I'm wrong): Since there's an action to insert into a hive table, I want to avoid writing to HDFS too often, so what I want is to hold enough data in memory and only then write to the filesystem. I thought that using window would be the right way to achieve it.
Now, in the image below, you can see that there are many batches being queued and the batch being processed, takes forever to complete.
I'm also providing the details of the single batch being processed:
Why are there so many tasks for the insert action, when there aren't many events in the batch? Sometimes having 0 events also generates thousands of tasks that take forever to complete.
Is the way I process microbatches with Spark wrong?
Thanks for your help!
Some extra details:
Yarn containers have a max of 2gb.
In this Yarn queue, the maximum number of containers is 10.
When I look at details of the queue where this spark application is being executed, the number of containers is extremely large, around 15k pending containers.
Well, I finally figured it out. Apparently Spark Streaming does not get along with empty events, so inside the foreachRDD portion of the code, I added the following:
eirRDD.foreachRDD { rdd =>
if (rdd.take(1).length != 0) {
//do action
}
}
That way we skip empty micro-batches. the isEmpty() method does not work.
Hope this help somebody else! ;)

How to consume from a different Kafka topic in each batch of a Spark Streaming job?

I am pretty sure that there is no simple way of doing this, but here is my use case:
I have a Spark Streaming job (version 2.1.0) with a 5 second duration for each micro batch.
My goal, is to consume data from 1 different topic at every microbatch interval, of a total 250 Kafka topics. You can take the code bellow as a simple example:
val groupId:String = "first_group"
val kafka_servers:String = "datanode1:9092,datanode2:9092,datanode3:9092"
val ss:SparkSession = SparkSession.builder().config("spark.streaming.unpersist","true").appName("ConsumerStream_test").getOrCreate()
val ssc:StreamingContext= new StreamingContext(ss.sparkContext,Duration(5000))
val kafka_parameters:Map[String,Object]=Map(
"bootstrap.servers" -> kafka_servers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[ByteArrayDeserializer],
"heartbeat.interval.ms" -> (1000:Integer),
"max.poll.interval.ms" -> (100:Integer),
"enable.auto.commit" -> (false: java.lang.Boolean),
"autoOffsetReset" -> OffsetResetStrategy.EARLIEST,
//"connections.max.idle.ms" -> (5000:Integer),
"group.id" -> groupId
)
val r = scala.util.Random
val kafka_list_one_topic=List("topic_"+ r.nextInt(250))
val consumer:DStream[ConsumerRecord[String,Array[Byte]]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferBrokers, ConsumerStrategies.
Subscribe[String, Array[Byte]](kafka_list_one_topic , kafka_parameters))
consumer.foreachRDD( eachRDD => {
// DOING SOMETHING WITH THE DATA...
})
ssc.start()
ssc.awaitTermination()
But the issue with this approach, is that Spark will only run the initial code (everything before the foreachRDD command) once, in order to create the Kafka consumer DStream, but in the following micro batch, it only runs the "foreachRDD" statement.
As an example, let's say that r.nextInt(250) returned 40. The Spark Streaming job will connect to topic_40 and process its data. But in the next micro batches, it will still connect to topic_40, and ignore all the commands before the foreachRDD statement.
I guess this is expected, since the code before the foreachRDD statement runs only on the Spark driver.
My question is, is there a way that I can do this without having to relaunch a Spark application every 5 seconds?
Thank you.
My approach would be really simple, if you want it to be really random and don't care about any other consequences, make the kafka_list_one_topic as a mutable variable and change it in the streaming code.
val r = scala.util.Random
var kafka_list_one_topic=List("topic_"+ r.nextInt(250))
val consumer:DStream[ConsumerRecord[String,Array[Byte]]] =
KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferBrokers,
ConsumerStrategies.
Subscribe[String, Array[Byte]](kafka_list_one_topic , kafka_parameters))
consumer.foreachRDD( eachRDD => {
// DOING SOMETHING WITH THE DATA...
kafka_list_one_topic=List("topic_"+ r.nextInt(250))
})
ssc.start()
ssc.awaitTermination()

Query Cassandra table for every Kafka Message

I am trying to query a cassandra table for every single kafka message.
Below is the code that I have been working on:
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark SQL basic example")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
val topicsSet = List("Test").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "12345",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val lines_myobjects = lines.map(line =>
new Gson().fromJson(line, classOf[myClass]) // The myClass is a simple case class which extends serializable
//This changes every single message into an object
)
Now things get complicated, I cannot get around the point where I can query the cassandra table with relevant to the message from the kafka message. Every single kafka message object has a return method.
I have tried multiple ways to get around this. For instance:
val transformed_data = lines_myobjects.map(myobject => {
val forest = spark.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "mytable", "keyspace" -> "mydb"))
.load()
.filter("userid='" + myobject.getuserId + "'")
)}
I have also tried ssc.cassandraTable which gave me no luck.
The main goal is to get all the rows from the database where the userid matches with the userid that comes from the kafka message.
One thing I would like to mention is that even though loading or querying the cassandra database every time is not efficient, the cassandra database changes everytime.
You can't do spark.read or ssc.cassandraTable inside .map(. Because it means you would try to create new RDD per each message. It shouldn't work like that.
Please, sider the following options:
1 - If you could ask required data by one/two CQL queries, try to use CassandraConnector inside the .mapPartitions(. Something like this:
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
val connector = ...instantiate CassandraConnector onece here
val transformed_data = lines_myobjects.mapPartitions(it => {
connector.withSessionDo { session =>
it.map(myobject => session.execute("CQL QUERY TO GET YOUR DATA HERE", myobject.getuserId)
})
2 - Otherwise (if you select by primary/partition key) consider .joinWithCassandraTable. Something like this:
import com.datastax.spark.connector._
val mytableRDD = sc.cassandraTable("mydb", "mytable")
val transformed_data = lines_myobjects
.map(myobject => {
Tuple1(myobject.getuserId) // you need to wrap ids to a tuple to do join with Cassandra
})
.joinWithCassandraTable("mydb", "mytable")
// process results here
I would approach this a different way.
The data that is flowing into Cassandra, route it through Kafka (and from Kafka send to the Cassandra with the Kafka Connect sink).
With your data in Kafka, you can then join between your streams of data, whether in Spark, or with Kafka's Streams API, or KSQL.
Both Kafka Streams and KSQL support stream-table joins that you're doing here. You can see it in action with KSQL here and here.

Apache Kafka: How to receive latest message from Kafka?

I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you
Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.
You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).
You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.
Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()

Extract the time stamp from kafka messages in spark streaming?

Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming.
kafka(version 0.10.0.0)
spark streaming(version 2.0.1)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "your.server.com:9092")
.option("subscribe", "your-topic")
.load()
.select($"timestamp", $"value")
Field "timestamp" is what you are looking for. Type - java.sql.Timestamp. Make sure that you are connecting to 0.10 Kafka server. There is no timestamp in earlier versions.
Full list of fields described here - http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
I'd suggest couple things:
Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka)
E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1
Than you create a stream, according to the docs above:
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker1:9092,broker2:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[ByteArrayDeserializer],
"group.id" -> "spark-streaming-test",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean))
val sparkConf = new SparkConf()
// suppose you have 60 second window
val ssc = new StreamingContext(sparkConf, Seconds(60))
ssc.checkpoint("checkpoint")
val stream = KafkaUtils.createDirectStream(ssc, PreferConsistent,
Subscribe[String, Array[Byte]](topics, kafkaParams))
Your stream will be an DStream of ConsumerRecord[String,Array[Byte]] and you can get a timestamp and key-value as simple as:
stream.map { record => (record.timestamp(), record.key(), record.value()) }
Hope that helps.