Spark Streaming with Kafka - scala

I want to do simple machine learning in Spark.
First the application should do some learning from historical data from a file, train the machine learning model and then read input from kafka to give predictions in real time. To do that I believe I should use spark streaming. However, I'm afraid that I don't really understand how spark streaming works.
The code looks like this:
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("test App")
val sc = new SparkContext(conf)
val fromFile = parse(sc, Source.fromFile("my_data_.csv").getLines.toArray)
ML.train(fromFile)
real_time(sc)
}
Where ML is a class with some machine learning things in it and train gives it data to train. There also is a method classify which calculates predictions based on what it learned.
The first part seems to work fine, but real_time is a problem:
def real_time(sc: SparkContext) : Unit = {
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val topic = "my_topic".split(",").toSet
val params = Map[String, String](("metadata.broker.list", "localhost:9092"))
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topic)
var lin = dstream.map(_._2)
val str_arr = new Array[String](0)
lin.foreach {
str_arr :+ _.collect()
}
val lines = parse(sc, str_arr).map(i => i.features)
ML.classify(lines)
ssc.start()
ssc.awaitTermination()
}
What I would like it to do is check the Kafka stream and compute it if there are any new lines. This doesn't seem to be the case, I added some prints and it is not printed.
How to use spark streaming, how should it be used in my case?

Related

Getting error while saving PairRdd in Spark Stream [duplicate]

This question already has an answer here:
Custom partiotioning of JavaDStreamPairRDD
(1 answer)
Closed 4 years ago.
I am trying to save my Pair Rdd in spark streaming but getting error while saving at last step .
Here is my sample code
def main(args: Array[String]) {
val inputPath = args(0)
val output = args(1)
val noOfHashPartitioner = args(2).toInt
println("IN Streaming ")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val hadoopConf = sc.hadoopConfiguration;
//hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
val ssc = new org.apache.spark.streaming.StreamingContext(sc, Seconds(60))
val input = ssc.textFileStream(inputPath)
val pairedRDD = input.map(row => {
val split = row.split("\\|")
val fileName = split(0)
val fileContent = split(1)
(fileName, fileContent)
})
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
//print(pairedRDD)
pairedRDD.partitionBy(new HashPartitioner(noOfHashPartitioner)).saveAsHadoopFile(output, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
I am getting at last step while saving .I am new to spark streaming so must be missing something here .
Getting error like
value partitionBy is not a member of
org.apache.spark.streaming.dstream.DStream[(String, String)]
Please help
pairedRDD is of type DStream[(String, String)] not RDD[(String,String)]. The method partitionBy is not available on DStreams.
Maybe look into foreachRDD which should be available on DStreams.
EDIT: A bit more context explanation textFileStream will set up a directory watch on the specified path and whenever there are new files will stream the content. so that's where the stream aspect comes from. Is that what you want? or do you just want to read the content of the directory "as is" once? Then there's readTextFiles which will return a non-stream container.

Partitionning the rdf datasets by subject in spark scala

I am a newbie to functional programming language and I am trying to learn spark scala
The goal is to partition the rdf datset by subject
the code is below:
object SimpleApp {
def main(args: Array[String]): Unit = {
val sparkConf =
new SparkConf().
setAppName("SimpleApp").
setMaster("local[2]").
set("spark.executor.memory", "1g")
val sc = new SparkContext(sparkConf)
val data = sc.textFile("/home/hduser/Bureau/11.txt")
val subject = data.map(_.split("\\s+")(0)).distinct.collect
}
}
So I get to recover the subjects but it returns an array of string also mapPartitions(func) and mapPartitionsWithIndex(func) : the func need to be iterator
So how do I proceed?
Partitioning your RDD by subject would probably best be done by using a HashPartitioner. The HashPartitioner works by taking an RDD of N-tuples and sorting the data by key eg
myPairRDD:
("sub1", "desc1")
("sub2", "desc2")
("sub1", "desc3")
("sub2", "desc4")
myPairRDD.partitionBy(new HashPartitioner(2))
becomes:
partition 1:
("sub1", "desc1")
("sub1", "desc3")
partition 2:
("sub2", "desc2")
("sub2", "desc4")
Therefore, your subjects RDD should probably be created more like this (note the extra brackets which create a tuple/pair RDD):
val subjectTuples = data.map((_.split("\\s+")(0), _.split("\\s+")(1)))
See the diagrams here for more info: https://blog.knoldus.com/2015/06/19/shufflling-and-repartitioning-of-rdds-in-apache-spark/

Save Scala Spark Streaming Data to MongoDB

Here's my simplified Apache Spark Streaming code which gets input via Kafka Streams, combine, print and save them to a file. But now i want the incoming stream of data to be saved in MongoDB.
val conf = new SparkConf().setMaster("local[*]")
.setAppName("StreamingDataToMongoDB")
.set("spark.streaming.concurrentJobs", "2")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topicName1 = List("KafkaSimple").toSet
val topicName2 = List("SimpleKafka").toSet
val stream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName1)
val stream2 = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicName2)
val lines1 = stream1.map(_._2)
val lines2 = stream2.map(_._2)
val allThelines = lines1.union(lines2)
allThelines.print()
allThelines.repartition(1).saveAsTextFiles("File", "AllTheLinesCombined")
I have tried Stratio Spark-MongoDB Library and some other resources but still no success. Someone please help me proceed or redirect me to some useful working resource/tutorial. Cheers :)
If you want to write out to a format which isn't directly supported on DStreams you can use foreachRDD to write out each batch one-by-one using the RDD based API for Mongo.
lines1.foreachRDD ( rdd => {
rdd.foreach( data =>
if (data != null) {
// Save data here
} else {
println("Got no data in this window")
}
)
})
Do same for lines2.

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

Saving twitter streams into a single file with spark streaming, scala

So after help from this answer Spark Streaming : Join Dstream batches into single output Folder I was able to create a single file for my twitter streams. However,now I don't see any tweets being saved in this file. Please find below my code snippet for this. What am I doing wrong?
val ssc = new StreamingContext(sparkConf, Seconds(5))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(r => r.getText)
tweets.foreachRDD{rdd =>
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.map(t => Record(t)).toDF()
df.save("com.databricks.spark.csv",SaveMode.Append,Map("path"->"tweetstream.csv")
}
ssc.start()
ssc.awaitTermination()
}