foreachRDD in Spark Stream will not write file - scala

This is my Spark Streaming job:
object StreamingTest {
def main(args: Array[String]) = {
val sparkConf = new SparkConf().setAppName("StreamingTest")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val lines = KafkaUtils.createStream(
ssc,
"localhost:2181/kafka",
"streaming-group",
Map("streaming-topic" -> 1),
StorageLevel.MEMORY_AND_DISK
).map(_.2)
lines.flatMap(_.split(" ")).map(token => s"${token} => ${token.hashCode}")
.foreachRDD(rdd => {
rdd.saveAsTextFile(s"/results/raw-${System.currentTimeInMillis()}.txt")
})
lines.flatMap(_.split(" ")).map(token => s"${token} => ${token.hashCode}")
.saveAsTextFiles("/results/raw", "test")
ssc.start
ssc.awaitTermination
}
}
The last save operation works. It writes files to /results/raw. The one inside the foreach loop does not. Can someone explain why?

Related

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database.
Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the calculations to the table.
My producer looks like this:
object ProducerPlayground extends App {
val topicName = "test"
private def createProducer: Properties = {
val producerProperties = new Properties()
producerProperties.setProperty(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,
"localhost:9092"
)
producerProperties.setProperty(
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
classOf[IntegerSerializer].getName
)
producerProperties.setProperty(
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
classOf[StringSerializer].getName
)
producerProperties
}
val producer = new KafkaProducer[Int, String](createProducer)
val source = Source.fromFile("G:\\text.txt", "UTF-8")
val lines = source.getLines()
var key = 0
for (line <- lines) {
producer.send(new ProducerRecord[Int, String](topicName, key, line))
key += 1
}
source.close()
producer.flush()
}
Consumer looks like this:
object BatchLayer {
def main(args: Array[String]) {
val brokers = "localhost:9092"
val topics = "test"
val groupId = "groupId-1"
val sparkConf = new SparkConf()
.setAppName("BatchLayer")
.setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(3))
val sc = ssc.sparkContext
sc.setLogLevel("OFF")
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers,
ConsumerConfig.GROUP_ID_CONFIG -> groupId,
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false"
)
val stream =
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val cass = CassandraConnector(sparkConf)
cass.withSessionDo { session =>
session.execute(
s"CREATE KEYSPACE IF NOT EXISTS batch_layer WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }"
)
session.execute(s"CREATE TABLE IF NOT EXISTS batch_layer.test (key VARCHAR PRIMARY KEY, value INT)")
session.execute(s"TRUNCATE batch_layer.test")
}
stream
.map(v => v.value())
.flatMap(x => x.split(" "))
.filter(x => !x.contains(Array('\n', '\t')))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("batch_layer", "test", SomeColumns("key", "value"))
ssc.start()
ssc.awaitTermination()
}
}
After starting producer, the program stops working with this error. What did I do wrong ?
It makes very little sense to use legacy streaming in 2021st - it's very cumbersome to use, and you also need to track offsets for Kafka, etc. It's better to use Structured Streaming instead - it will track offsets for your through the checkpoints, you will work with high-level Dataset APIs, etc.
In your case code could look as following (didn't test, but it's adopted from this working example):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val wordsCountsDF = streamingInputDF.selectExpr("CAST(value AS STRING) as value")
.selectExpr("split(value, '\\w+', -1) as words")
.selectExpr("explode(words) as word")
.filter("word != ''")
.groupBy($"word")
.count()
.select($"word", $"count")
// create table ...
val query = wordsCountsDF.writeStream
.outputMode(OutputMode.Update)
.format("org.apache.spark.sql.cassandra")
.option("checkpointLocation", "path_to_checkpoint)
.option("keyspace", "test")
.option("table", "<table_name>")
.start()
query.awaitTermination()
P.S. In your example, most probable error is that you're trying to use .saveToCassandra directly on DStream - it doesn't work this way.

Save Scala Spark Streaming Data to MongoDB

Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.

No messages received when using foreachPartition spark streaming

I am pulling from Kafka using Spark Streaming. When I use foreachPartition on my RDD I never get any messages received. If I read the messages from the RDD using a foreach it works fine. However I need to use the partition function so I can have a socket connection on each executor.
This is code connecting to spark and creating stream
val kafkaParams = Map(
"zookeeper.connect" -> zooKeepers,
"group.id" -> ("metric-group"),
"zookeeper.connection.timeout.ms" -> "5000")
val inputTopic = "threatflow"
val conf = new SparkConf().setAppName(applicationTitle).set("spark.eventLog.overwrite", "true")
val ssc = new StreamingContext(conf, Seconds(5))
val streams = (1 to numberOfStreams) map { _ =>
KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, Map(inputTopic -> 1), StorageLevel.MEMORY_ONLY_SER)
}
val kafkaStream = ssc.union(streams)
kafkaStream.foreachRDD { (rdd, time) =>
calcVictimsProcess(process, rdd, time.milliseconds)
}
ssc.start()
ssc.awaitTermination()
Here is my code that attempts to process the messages using foreachPartition instead of foreach
val threats = rdd.map(message => gson.fromJson(message._2.substring(1, message._2.length()), classOf[ThreatflowMessage]))
threats.flatMap(mapSrcVictim).reduceByKey((a,b) => a + b).foreachPartition{ partition =>
val socket = new Socket(InetAddress.getByName("localhost"),4242)
val writer = new BufferedOutputStream(socket.getOutputStream)
partition.foreach{ value =>
val parts = value._1.split("-")
val put = "put %s %d %d type=%s address=%s unique=%s\n".format("metric", bucket, value._2, parts(0),parts(1),unique)
Thread.sleep(10000)
}
writer.flush()
socket.close()
}
simply switching this to foreach as I said will work, however this won't work as I need to have sockets created per executor

Spark Streaming, kafka: java.lang.StackOverflowError

I am getting below error in spark-streaming application, i am using kafka for input stream. When i was doing with socket, it was working fine. But when i changed to kafka it's giving error. Anyone has idea why it's throwing error, do i need to change my batch time and check pointing time?
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.StackOverflowError
My program:
def main(args: Array[String]): Unit = {
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
// create a StreamingContext, the main entry point for all streaming functionality
val ssc = new StreamingContext(sc, Seconds(5))
val brokers = args(0)
val topics= args(1)
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
val inputStream = messages.map(_._2)
// val inputStream = ssc.socketTextStream(args(0), args(1).toInt)
ssc.checkpoint(checkpointDirectory)
inputStream.print(1)
val parsedStream = inputStream
.map(line => {
val splitLines = line.split(",")
(splitLines(1), splitLines.slice(2, splitLines.length).map((_.trim.toLong)))
})
import breeze.linalg.{DenseVector => BDV}
import scala.util.Try
val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
(current: Seq[Array[Long]], prev: Option[Array[Long]]) => {
prev.map(_ +: current).orElse(Some(current))
.flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
})
state.checkpoint(Duration(10000))
state.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
}
}
Try to delete the checkpoint directory.
I'm not sure but it seems that your streaming context fails to restore from the checkpoints.
anyway, it worked for me.

Why saveAsTextFile do nothing?

I'm trying to implement simple WordCount in Scala + Spark. Here is my code
object FirstObject {
def main(args: Array[String]) {
val input = "/Data/input"
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://192.168.1.162:7077")
val sparkContext = new SparkContext(conf)
val text = sparkContext.textFile(input).cache()
val wordCounts = text.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((a,b) => a+b)
.sortByKey()
wordCounts.saveAsTextFile("/Data/output")
}
This job is working for 54s, and finally do nothing. Is is not writing output to /Data/output
Also if I replace saveAsTextFile with forEach(println) it is produce desired output.
You should check your user rights for /data/output folder.
This folder should have writing rights for your specific user.