Inconsistent fetch by spark kafka consumer - scala

I have written a code to fetch records from kafka into spark. I have come across some strange behaviour. It is consuming in inconsistent order.
val conf = new SparkConf()
.setAppName("Test Data")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
.set("spark.local.dir", "//tmp//")
.set("spark.sql.warehouse.dir", "/tmp/hive/")
.set("hive.exec.scratchdir", "/tmp/hive2")
val spark = SparkSession
.builder
.appName("Test Data")
.config(conf)
.getOrCreate()
import spark.implicits._
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "192.168.0.40:2181",
"group.id" -> "=groups",
"auto.offset.reset" -> "smallest")
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}
kafkaStream.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
println("Count of rows " + rdd.count())
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
So, Initially I produced 10 million records in kafka. Now producer is stopped and then started Spark Consumer Application. I checked Spark UI, initially I received 700,000-900,000 records per batch(every 10 seconds ) per stream, afterwards started getting 4-6K records per batch. So wanted to understand why the fetch count fell so badly despite the fact that data is present in Kafka so instead of giving 4k per batch , I'am open to consumer directly big size batch. What can be done and how ?
Thanks,

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?
It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

How to turn this simple Spark Streaming code into a Multi threaded one?

I am learning Kafka in Scala. The attached code is just a word count implementation using Kafka and Spark Streaming.
How do I have a separate consumer execution per partition whilst streaming? Please help!
Here is my code:
class ConsumerM(topics: String, bootstrap_server: String, group_name: String) {
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
.setMaster("local[*]")
.set("spark.executor.memory","1g")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val topicsSet = topics.split(",")
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> bootstrap_server,
ConsumerConfig.GROUP_ID_CONFIG -> group_name,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
"auto.offset.reset" ->"earliest")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
val lines = messages.map(_.value)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
Assuming your input topic has multiple partitions, then additionally setting local[*] means you'll have one Spark executor per CPU core, and at least one partition can be consumed by each

Kafka + spark streaming : Multi topic processing in single job

There are 40 topic in Kafka and written spark streaming job to process 5 table each.
Only objective of spark streaming job is to read 5 kafka topic and write it into corresponding 5 hdfs path. Most of the time its working fine, but some time it writing the topic 1 data to other hdfs path.
Below is the code tried to archive one spark streaming job to process 5 topic and write it into corresponding hdfs, But this writing topic 1 data to HDFS 5 instead of HDFS 1.
Please provide your suggestion :
import java.text.SimpleDateFormat
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, TaskContext }
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer
object SparkKafkaMultiConsumer extends App {
override def main(args: Array[String]) {
if (args.length < 1) {
System.err.println(s"""
|Usage: KafkaStreams auto.offset.reset latest/earliest table1,table2,etc
|
""".stripMargin)
System.exit(1)
}
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
import com.typesafe.config.ConfigFactory
val conf = ConfigFactory.load("env.conf")
val topicconf = ConfigFactory.load("topics.conf")
// Create context with custom second batch interval
val sparkConf = new SparkConf().setAppName("pt_streams")
val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
var kafka_topics="kafka.topics"
// Create direct kafka stream with brokers and topics
var topicsSet = topicconf.getString(kafka_topics).split(",").toSet
if(args.length==2 ) {
print ("This stream job will process table(s) : "+ args(1))
topicsSet=args {1}.split(",").toSet
}
val topicList = topicsSet.toList
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups"),
"auto.offset.reset" -> args { 0 },
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
for (i <- 0 until topicList.length) {
/**
* set timer to see how much time takes for the filter operation for each topics
*/
val topicStream = messages.filter(_.topic().equals(topicList(i)))
val data = topicStream.map(_.value())
data.foreachRDD((rdd, batchTime) => {
// val data = rdd.map(_.value())
if (!rdd.isEmpty()) {
rdd.coalesce(1).saveAsTextFile(conf.getString("hdfs.streamoutpath") + PATH_SEPERATOR + topicList(i) + PATH_SEPERATOR + date_today.format(System.currentTimeMillis())
+ PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
}
})
}
try{
// After all successful processing, commit the offsets to kafka
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
} catch {
case e: Exception =>
e.printStackTrace()
print("error while commiting the offset")
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
You're better off using the HDFS connector for Kafka Connect. It is open source and available standalone or as part of Confluent Platform. Simple configuration file to stream from Kafka topics to HDFS, and as a bonus it will create the Hive table for you if you have a schema for your data.
You're re-inventing the wheel if you try to code this yourself; it's a solved problem :)

Save Scala Spark Streaming Data to MongoDB

Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.

No messages received when using foreachPartition spark streaming

I am pulling from Kafka using Spark Streaming. When I use foreachPartition on my RDD I never get any messages received. If I read the messages from the RDD using a foreach it works fine. However I need to use the partition function so I can have a socket connection on each executor.
This is code connecting to spark and creating stream
val kafkaParams = Map(
"zookeeper.connect" -> zooKeepers,
"group.id" -> ("metric-group"),
"zookeeper.connection.timeout.ms" -> "5000")
val inputTopic = "threatflow"
val conf = new SparkConf().setAppName(applicationTitle).set("spark.eventLog.overwrite", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val streams = (1 to numberOfStreams) map { _ =>
KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, Map(inputTopic -> 1), StorageLevel.MEMORY_ONLY_SER)
}
val kafkaStream = ssc.union(streams)
kafkaStream.foreachRDD { (rdd, time) =>
calcVictimsProcess(process, rdd, time.milliseconds)
}
ssc.start()
ssc.awaitTermination()
Here is my code that attempts to process the messages using foreachPartition instead of foreach
val threats = rdd.map(message => gson.fromJson(message._2.substring(1, message._2.length()), classOf[ThreatflowMessage]))
threats.flatMap(mapSrcVictim).reduceByKey((a,b) => a + b).foreachPartition{ partition =>
val socket = new Socket(InetAddress.getByName("localhost"),4242)
val writer = new BufferedOutputStream(socket.getOutputStream)
partition.foreach{ value =>
val parts = value._1.split("-")
val put = "put %s %d %d type=%s address=%s unique=%s\n".format("metric", bucket, value._2, parts(0),parts(1),unique)
Thread.sleep(10000)
}
writer.flush()
socket.close()
}
simply switching this to foreach as I said will work, however this won't work as I need to have sockets created per executor