storing the string from kafkaStream into a variable for processing - scala

I need to get my messages from a Kafka producer and from the messages I need to find the words that contain % and generate a message for different % values. Finally I need to send it to ElasticSearch.
I am able to see the values in console using kafkaStream.print() but I need to process the string to match with required keywords and generate the message.
My code:
package rnd
import org.apache.spark.SparkConf
import kafka.serializer.StringDecoder
import org.apache.spark.sql.SQLContext
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object WordFind {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("KafkaReceiver")
val checkpointDir = "/usr/local/kafka/kafka_2.11-0.11.0.2/checkpoint/"
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
val batchIntervalSeconds = 2
val ssc = new StreamingContext(conf, Seconds(10))
import org.apache.spark.streaming.kafka.KafkaUtils
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("wordcounttopic" -> 5))
val s = kafkaStream.print()
println(" the words are: " + s)
ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
ssc.start()
ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 5 * 1000)
}
}
If I pass "The usage is 75%" through the Lafka producer I should generate a message saying "Increase ram by 25%" in ElasticSearch.
The output that I am getting is:
18/02/09 16:38:27 INFO BlockManagerMasterEndpoint: Registering block manager localhost:37879 with 2.4 GB RAM, BlockManagerId(driver, localhost, 37879)
18/02/09 16:38:27 INFO BlockManagerMaster: Registered BlockManager
18/02/09 16:38:27 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
***the words are: ()***
I want the String that i am passing in place of () in 's'.

The val kafkaStream is a RecieverInputDStream[(String, String)], where the data is (kafkaMetaData, kafkaMessage)
for more information see [https://github.com/apache/spark/blob/f830bb9170f6b853565d9dd30ca7418b93a54fe3/external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala#L135 ].
We need to extract the second of the tuple and do the pattern matching (i.e filter RecieverInputDStream find the words that contain %) and then use map to generate output (i.e a message for different % values). And as mentioned by #stefanobaghino, the print() function just prints the output to the console and doesn't return any string of the record.
for example:
import org.apache.spark.streaming.dstream.ReceiverInputDStream
val kafkaStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(sparkStreamingContext, "localhost:2181",
"spark-streaming-consumer-group", Map("wordcounttopic" -> 5))
import org.apache.spark.streaming.dstream.DStream
val filteredStream: DStream[(String, String)] = kafkaStream
.filter(record => record._2.contains("%")) // TODO : pattern matching here
val outputDStream: DStream[String] = filteredStream
.map(record => record._2.toUpperCase()) // just assuming some operation
outputDStream.print()
Use the outputDStream to be written into ElasticSearch. Hope this helps.

Related

How to check if the batches are empty in Spark streaming (wordcount with socketTextStream)

I working on simple SparkStreaming wordcount example to to count the number of words in text data received from a data server listening on a TCP socket.
I would like to check if the batch from streaming source is empty or not before I save the content of every transformation to a text files. Currently, I am using Spark Shell. This is my code
I have tried this code, and it works fine without checking if the batch is empty or not:
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(2))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
ssc.start()
I have tried to use foreachRDD but it gives me an error error: value saveAsTextFiles is not a member of org.apache.spark.rdd.RDD[String]
This is my code
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.{Level, Logger}
Logger.getRootLogger.setLevel(Level.WARN)
val ssc = new StreamingContext(sc, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
lines.foreachRDD(rdd => {
if(!rdd.partitions.isEmpty)
{
lines.saveAsTextFiles("/stream_test/testLine.txt")
val words = lines.flatMap(_.split(" "))
words.saveAsTextFiles("/stream_test/testWords.txt")
val pairs = words.map((_, 1))
pairs.saveAsTextFiles("/stream_test/testPairs.txt")
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFiles("/stream_test/testWordsCounts.txt")
wordCounts.print()
}
})
ssc.start()
I need to to check if the batch from streaming source is empty or not before I save the content text files. I appreciate your help
I used to do it using following code. I will loop each rdd in stream and then use rdd.count() to judge if a rdd is empty. if all rdds is empty, nothing happened, hope it can help you.
kafkaStream.foreachRDD(rdd -> {
if(rdd.count() > 0) {
// do something
}
})
You can try the below code snippet to check your streaming batches are empty or not:
if(!rdd.partitions.isEmpty)
rdd.saveAsTextFile(outputDir)

Avoid multiple connections to mongoDB from spark streaming

We developed a spark streaming application that sources data from kafka and writes to mongoDB. We are noticing performance implications while creating connections inside foreachRDD on the input DStream. The spark streaming application does a few validations before inserting into mongoDB. We are exploring options to avoid connecting to mongoDB for each message that is processed, rather we desire to process all messages within one batch interval at once. Following is the simplified version of the spark streaming application. One of the things we did is append all the messages to a dataframe and try inserting the contents of that dataframe outside of the foreachRDD. But when we run this application, the code that writes dataframe to mongoDB does not get executed.
Please note that I commented out a part of the code inside foreachRDD which we used to insert each message into mongoDB. Existing approach is very slow as we are inserting one message at a time. Any suggestions on performance improvement is much appreciated.
Thank you
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
//Earlier we were inserting each message into mongoDB, which we would like to avoid and process all at once
/* df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()*/
outDf = outDf.union(df)
}
}
)
//Added this part of the code in expectation to access the unioned dataframe and insert all messages at once
//println(outDf.count())
if(outDf.count() > 0)
{
outDf.write
.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
}
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
It sounds like you would want to reduce the number of connections to mongodb, for this purpose, you must use foreachPartition in code when you serve connection do mongodb see spec, the code will look like this:
rdd.repartition(1).foreachPartition {
//get instance of connection
//write/read with batch to mongo
//close connection
}

do not want string as type when using foreach in scala spark streaming?

code snippet :
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
write2hdfs.foreachRDD(rdd => {
rdd.foreach(avroRecord => {
println(avroRecord)
//val rawByte = avroRecord.getBytes("UTF-8")
Issue faced>
avroRecord holds avro encoded messages received from kafka stream.
By default avroRecord is a string when the above code is being used.
And string has UTF-16 encoding as default in scala.
Due this deserialization is not correct and facing issues.
Messages were encoded into avro with utf-8 when sent to kafka stream.
I would need avroRecord to be pure bytes instead of getting as string and then converting to bytes(internally string would do utf-16 encoding).
or a way to get avroRecord itself in utf-8. Stuck here deadblock.
Need a way forward for this problem statement.
Thanks in advance.
UPDATE:
Code snippet changed >
val ssc = new StreamingContext(sparkConf, Seconds(5))
//val ssc = new JavaStreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val kafkaParams = Map[String, String]("zookeeper.connect" ->
zkQuorum,"group.id" -> group,"zookeeper.connection.timeout.ms" -> "10000")
//val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
val lines =
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topics,StorageLevel.NONE)
imports done :
import org.apache.spark.streaming._
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
import org.apache.avro
import org.apache.avro.Schema
import org.apache.avro.generic.{GenericDatumReader, GenericRecord,
GenericDatumWriter, GenericData}
import org.apache.avro.io.{DecoderFactory, DatumReader, DatumWriter,
BinaryDecoder}
import org.apache.avro.file.{DataFileReader, DataFileWriter}
import java.io.{File, IOException}
//import java.io.*
import org.apache.commons.io.IOUtils;
import _root_.kafka.serializer.{StringDecoder, DefaultDecoder}
import _root_.kafka.message.Message
import scala.reflect._
Compilation error :
Compiling 1 Scala source to /home/spark_scala/spark_stream_project/target/scala-2.10/classes...
[error] /home/spark_scala/spark_stream_project/src/main/scala/sparkStreaming.scala:34: overloaded method value createStream with alternatives:
[error] (jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,keyTypeClass: Class[String],valueTypeClass: Class[kafka.message.Message],keyDecoderClass: Class[kafka.serializer.StringDecoder],valueDecoderClass: Class[kafka.serializer.DefaultDecoder],kafkaParams: java.util.Map[String,String],topics: java.util.Map[String,Integer],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream[String,kafka.message.Message]
[error] (ssc: org.apache.spark.streaming.StreamingContext,kafkaParams: scala.collection.immutable.Map[String,String],topics: scala.collection.immutable.Map[String,Int],storageLevel: org.apache.spark.storage.StorageLevel)(implicit evidence$1: scala.reflect.ClassTag[String], implicit evidence$2: scala.reflect.ClassTag[kafka.message.Message], implicit evidence$3: scala.reflect.ClassTag[kafka.serializer.StringDecoder], implicit evidence$4: scala.reflect.ClassTag[kafka.serializer.DefaultDecoder])org.apache.spark.streaming.dstream.ReceiverInputDStream[(String, kafka.message.Message)]
[error] cannot be applied to (org.apache.spark.streaming.StreamingContext, scala.collection.immutable.Map[String,String], String, org.apache.spark.storage.StorageLevel)
[error] val lines = KafkaUtils.createStreamString,Message,StringDecoder,DefaultDecoder
[error] ^
[error] one error found
What is wrong here.
Also, i dont see the correct constructor as suggested being defined in the kafkaUtils API doc.
API Doc ref am referring :
https://spark.apache.org/docs/1.3.0/api/java/index.html?
org/apache/spark/streaming/kafka/KafkaUtils.html
looking forward for support.
Thanks.
UPDATE 2:
Tried with corrections suggested!
code snippet>
val lines =
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
Facing runtime exception>
java.lang.ClassCastException: [B cannot be cast to kafka.message.Message
On line :
KafkaUtils.createStream[String,Message,StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
val write2hdfs = lines.filter(x => x._1 == "lineitem").map(_._2)
ideally filter this Dstream(String,Message) should also work right ?
Do i need to extract the payload from Message before subjecting to map ?
need inputs please.
Thanks
You could do something like this:
import kafka.serializer.{StringDecoder, DefaultDecoder}
import kafka.message.Message
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> group,
"zookeeper.connection.timeout.ms" -> "10000")
val lines = KafkaUtils.createStream[String, Message, StringDecoder, DefaultDecoder](
ssc, kafkaParams, topics, storageLevel)
This should get you a DStream[(String, kafka.message.Message)], and you should be able to retrieve the raw bytes and convert to Avro from there.
This worked for me :
val lines =
KafkaUtils.createStream[String,Array[Byte],StringDecoder,DefaultDecoder]
(ssc,kafkaParams,topicMap,StorageLevel.MEMORY_AND_DISK_2)
My requirement was to get the Byte Array, so changed to Array[Byte] instead of kafka.message.Message

Spark streaming: How to write cumulative output?

I have to write a single output file for my streaming job.
Question : when will my job actually stop? I killed the server but did not work.
I want to stop my job from commandline(If it is possible)
Code:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
object MAYUR_BELDAR_PROGRAM5_V1 {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.socketTextStream("localhost", args(0).toInt)
val words = lines.flatMap(_.split(" "))
val class1 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==0)
val class2 = words.filter(a => a.charAt(0).toInt%2==0).map(a => a).filter(a => a.length%2==1)
val class3 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==0)
val class4 = words.filter(a => a.charAt(0).toInt%2==1).map(a => a).filter(a => a.length%2==1)
class1.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class1","txt")
class2.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class2", "txt")
class3.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class3","txt")
class4.saveAsTextFiles("hdfs://hadoop1:9000/mbeldar/class4","txt")
ssc.start() // Start the computation
ssc.awaitTermination()
ssc.stop()
}
}
A stream by definition does not have an end so it will not stop unless you call the method to stop it. In my case I have a business condition that tell when the process is finished, so when I reach this point I'm calling the method JavaStreamingContext.close(). I also have a monitor that checks if the process has not received any data in the past few minutes in which case it will also close the stream.
In order to accumulate data you have to use the method updateStateByKey (on a PairDStream). This method requires checkpointing to be enabled.
I have checked the Spark code and found that saveAsTextFiles uses foreachRDD, so at the end it will save each RDD separately, so previous RDDs will not be taken into account. Using updateStateByKey it will still save multiple files, but each file will consider all RDDs that were processed before.

spark streaming wordcount is not printing the results

I am trying to run this app in spark streaming the code is from a book I am reading but unfortunately I am not getting the expected results. There is a java class in which I open a socket and wait for an input. I run the socket code and connect it properly with the spark job. Then I submit the following job and I get a message that I connected successfully. When I type something in the socket I want to get a wordcount result printed in the terminal instead I am getting this message:
INFO BlockManagerInfo: Added input-0-1480077969600 in memory on 192.168.1.4:38818 (size: 7.0 B, free: 265.1 MB)
where is the problem? See the code bellow, thanks in advance
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.dstream.ForEachDStream
object ScalaFirstStreamingExample {
def main(args: Array[String]){
println("Creating Spark Configuration") //Create an Object of Spark Configuration
val conf = new SparkConf() //Set the logical and user defined Name of this Application
conf.setAppName("My First Spark Streaming Application")
println("Retreiving Streaming Context from Spark Conf") //Retrieving Streaming Context from SparkConf Object.
//Second parameter is the time interval at which
//streaming data will be divided into batches
val streamCtx = new StreamingContext(conf, Seconds(2)) //Define the type of Stream. Here we are using TCP
//Socket as textstream,
//It will keep watching for the incoming data from a
//specific machine (localhost) and port (9087)
//Once the data is retrieved it will be saved in the
//memory and in case memory
//is not sufficient, then it will store it on the Disk
//It will further read the Data and convert it into DStream
val lines = streamCtx.socketTextStream("localhost", 9087, MEMORY_AND_DISK_SER_2) //Apply the Split() function to all elements of DStream
//which will further generate multiple new records from
//each record in Source Stream
//And then use flatmap to consolidate all records and
//create a new DStream.
val words = lines.flatMap(x => x.split(" ")) //Now, we will count these words by applying a using map()
//map() helps in applying a given function to each
//element in an RDD.
val pairs = words.map(word => (word, 1)) //Further we will aggregate the value of each key by
//using/applying the given function.
val wordCounts = pairs.reduceByKey(_ + _) //Lastly we will print all Values
//wordCounts.print(20)
myPrint(wordCounts,streamCtx)
//Most important statement which will initiate the
//Streaming Context
streamCtx.start();
//Wait till the execution is completed.
streamCtx.awaitTermination();
}
def myPrint(stream:DStream[(String,Int)],streamCtx: StreamingContext){
stream.foreachRDD(foreachFunc)
def foreachFunc = (rdd: RDD[(String,Int)]) => {
val array = rdd.collect()
println("---------Start Printing Results----------")
for(res<-array){
println(res)
}
println("---------Finished Printing Results----------")
}
}
}