Transform input stream to key-values pairs stream - scala

I am new to Spark and Scala so my question probably is rather easy but I still struggle to find an answer. I need to join two Spark streams but I have problems with converting those streams to appropriate format. Please see my code below:
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")[0], line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")[0], line))
val newStream = pairs7.join(pairs8)
This doesn't work because "join" function expects streams in format DStream[String, String] and result of map function is DStream[(String, String)] .
And now my question is how to code this map function to get appropriate output (little explanation would be also great)?
Thanks in advance.

This works as expected:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(30))
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")(0), line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")(0), line))
val newStream = pairs7.join(pairs8)
newStream.foreachRDD(rdd => println(rdd.collect.map(_.toString).mkString(",")))
ssc.start
The only issue I see is a syntax error on: line.split(" ")[0] vs line.split(" ")(0) but I guess that would be noticed by the compiler.

Related

Unable to fetch specific content from tweets using scala on spark shell

I am working on Hortonworks.I have stored tweets from twitter to Kafka topic.I am performing sentiment analysis on tweets using Kafka as a Producer and Spark as a Consumer using Scala on Spark-shell.But I want to fetch only specific content from tweets like Text,HashTag,tweets is positive or negative,words from the tweets which i selected as a positive or negative word.my training data is Data.txt.
I added dependencies :
org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
Here is my code:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
I got the output like this,
(negative,fear)
(positive,fitness)
I want output like this,
(#sports,Text from the Tweets,fitness,positive)
But I am not getting the solution to store Text and Hashtag like above.

Issue while fetching specific content from tweets using spark in scala

I am working on Hortonworks.I have stored tweets from twitter to Kafka topic.I am performing sentiment analysis on tweets using Kafka as a Producer and Spark as a Consumer using Scala on Spark-shell.But I want to fetch only specific content from tweets like Text, HashTag, tweets is positive or negative, words from the tweets which i selected as a positive or negative word.my training data is Data.txt.
Data.txt contain words and posititve, the negative word which is seperated by Tab....
like positive
doom negative
doomed negative
doubt positive
I added dependencies : org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
Here is my code:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
I got the output like this,
(negative,fear) (positive,fitness)
I want output like this,
(#sports,Text from the Tweets,fitness,positive)
But I am not getting the solution to store Text and Hashtag like above.

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?
I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

Why saveAsTextFile do nothing?

I'm trying to implement simple WordCount in Scala + Spark. Here is my code
object FirstObject {
def main(args: Array[String]) {
val input = "/Data/input"
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://192.168.1.162:7077")
val sparkContext = new SparkContext(conf)
val text = sparkContext.textFile(input).cache()
val wordCounts = text.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((a,b) => a+b)
.sortByKey()
wordCounts.saveAsTextFile("/Data/output")
}
This job is working for 54s, and finally do nothing. Is is not writing output to /Data/output
Also if I replace saveAsTextFile with forEach(println) it is produce desired output.
You should check your user rights for /data/output folder.
This folder should have writing rights for your specific user.

Apache Spark - Reduce Step Output (K, (V1,V2, V3, ...)

I have a chronological sequence of events (T1,K1,V1),(T2,K2,V3),(T3,K1,V2),(T4,K2,V4),(T5,K1,V5).
Both keys and values are strings.
I am trying to achieve the following using Spark
K1,(V1,V2,V5)
K2,(V3,V4)
This is what I tried
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile, 2).cache()
val rdd2= rdd1.map {
line =>
val fields = line.split(" ")
val key = fields(1)
val v = fields(2)
(key, v)
}
// TODO : rdd2.reduce to get the output I want
rdd2.saveAsTextFile(outputFile)
Could someone please point me towards how to get the reducer to produce the output I want? Many thanks in advance.
You just need to group your rdd by key to achieve the desired output: rdd2.groupByKey
This small spark-shell session illustrates the usage:
val events = List(("t1","k1","v1"), ("t2","k2","v3"), ("t3","k1","v2"), ("t4","k2","v4"), ("t5","k1","v5"))
val rdd = sc.parallelize(events)
val kv = rdd.map{case (t,k,v) => (k,v)}
val grouped = kv.groupByKey
// show the collection ('collect' used here only to show the contents)
grouped.collect
res0: Array[(String, Iterable[String])] = Array((k1,ArrayBuffer(v1, v2, v5)), (k2,ArrayBuffer(v3, v4)))