Transform input stream to key-values pairs stream

Transform input stream to key-values pairs stream - scala

I am new to Spark and Scala so my question probably is rather easy but I still struggle to find an answer. I need to join two Spark streams but I have problems with converting those streams to appropriate format. Please see my code below:
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")[0], line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")[0], line))
val newStream = pairs7.join(pairs8)
This doesn't work because "join" function expects streams in format DStream[String, String] and result of map function is DStream[(String, String)] .
And now my question is how to code this map function to get appropriate output (little explanation would be also great)?
Thanks in advance.

This works as expected:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(30))
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")(0), line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")(0), line))
val newStream = pairs7.join(pairs8)
newStream.foreachRDD(rdd => println(rdd.collect.map(_.toString).mkString(",")))
ssc.start
The only issue I see is a syntax error on: line.split(" ")[0] vs line.split(" ")(0) but I guess that would be noticed by the compiler.

Related

Unable to fetch specific content from tweets using scala on spark shell

I am working on Hortonworks.I have stored tweets from twitter to Kafka topic.I am performing sentiment analysis on tweets using Kafka as a Producer and Spark as a Consumer using Scala on Spark-shell.But I want to fetch only specific content from tweets like Text,HashTag,tweets is positive or negative,words from the tweets which i selected as a positive or negative word.my training data is Data.txt.
I added dependencies :
org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
Here is my code:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
I got the output like this,
(negative,fear)
(positive,fitness)
I want output like this,
(#sports,Text from the Tweets,fitness,positive)
But I am not getting the solution to store Text and Hashtag like above.

Issue while fetching specific content from tweets using spark in scala

I am working on Hortonworks.I have stored tweets from twitter to Kafka topic.I am performing sentiment analysis on tweets using Kafka as a Producer and Spark as a Consumer using Scala on Spark-shell.But I want to fetch only specific content from tweets like Text, HashTag, tweets is positive or negative, words from the tweets which i selected as a positive or negative word.my training data is Data.txt.
Data.txt contain words and posititve, the negative word which is seperated by Tab....
like positive
doom negative
doomed negative
doubt positive
I added dependencies : org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
Here is my code:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
I got the output like this,
(negative,fear) (positive,fitness)
I want output like this,
(#sports,Text from the Tweets,fitness,positive)
But I am not getting the solution to store Text and Hashtag like above.

How to use feature extraction with DStream in Apache Spark

I have data that arrive from Kafka through DStream. I want to perform feature extraction in order to obtain some keywords.
I do not want to wait for arrival of all data (as it is intended to be continuous stream that potentially never ends), so I hope to perform extraction in chunks - it doesn't matter to me if the accuracy will suffer a bit.
So far I put together something like that:
def extractKeywords(stream: DStream[Data]): Unit = {
val spark: SparkSession = SparkSession.builder.getOrCreate
val streamWithWords: DStream[(Data, Seq[String])] = stream map extractWordsFromData
val streamWithFeatures: DStream[(Data, Array[String])] = streamWithWords transform extractFeatures(spark) _
val streamWithKeywords: DStream[DataWithKeywords] = streamWithFeatures map addKeywordsToData
streamWithFeatures.print()
}
def extractFeatures(spark: SparkSession)
(rdd: RDD[(Data, Seq[String])]): RDD[(Data, Array[String])] = {
val df = spark.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numOfFeatures)
val rawFeatures = hashingTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(rawFeatures)
val rescaledData = idfModel.transform(rawFeature)
import spark.implicits._
rescaledData.select("data", "features").as[(Data, Array[String])].rdd
}
However, I received java.lang.IllegalStateException: Haven't seen any document yet. - I am not surprised as I just try out to scrap things together, and I understand that since I am not waiting for an arrival of some data, the generated model might be empty when I try to use it on data.
What would be the right approach for this problem?

I used advises from comments and split the procedure into 2 runs:
one that calculated IDF model and saves it to file
def trainFeatures(idfModelFile: File, rdd: RDD[(String, Seq[String])]) = {
val session: SparkSession = SparkSession.builder.getOrCreate
val wordsDf = session.createDataFrame(rdd).toDF("data", "words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedDf)
idfModel.write.save(idfModelFile.getAbsolutePath)
}
one that reads IDF model from file and simply runs it on all incoming information
val idfModel = IDFModel.load(idfModelFile.getAbsolutePath)
val documentDf = spark.createDataFrame(rdd).toDF("update", "document")
val tokenizer = new Tokenizer().setInputCol("document").setOutputCol("words")
val wordsDf = tokenizer.transform(documentDf)
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures")
val featurizedDf = hashingTF.transform(wordsDf)
val extractor = idfModel.setInputCol("rawFeatures").setOutputCol("features")
val featuresDf = extractor.transform(featurizedDf)
featuresDf.select("update", "features")

Why saveAsTextFile do nothing?

I'm trying to implement simple WordCount in Scala + Spark. Here is my code
object FirstObject {
def main(args: Array[String]) {
val input = "/Data/input"
val conf = new SparkConf().setAppName("Simple Application")
.setMaster("spark://192.168.1.162:7077")
val sparkContext = new SparkContext(conf)
val text = sparkContext.textFile(input).cache()
val wordCounts = text.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((a,b) => a+b)
.sortByKey()
wordCounts.saveAsTextFile("/Data/output")
}
This job is working for 54s, and finally do nothing. Is is not writing output to /Data/output
Also if I replace saveAsTextFile with forEach(println) it is produce desired output.

You should check your user rights for /data/output folder.
This folder should have writing rights for your specific user.

Apache Spark - Reduce Step Output (K, (V1,V2, V3, ...)

I have a chronological sequence of events (T1,K1,V1),(T2,K2,V3),(T3,K1,V2),(T4,K2,V4),(T5,K1,V5).
Both keys and values are strings.
I am trying to achieve the following using Spark
K1,(V1,V2,V5)
K2,(V3,V4)
This is what I tried
val inputFile = args(0)
val outputFile = args(1)
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile(inputFile, 2).cache()
val rdd2= rdd1.map {
line =>
val fields = line.split(" ")
val key = fields(1)
val v = fields(2)
(key, v)
}
// TODO : rdd2.reduce to get the output I want
rdd2.saveAsTextFile(outputFile)
Could someone please point me towards how to get the reducer to produce the output I want? Many thanks in advance.

You just need to group your rdd by key to achieve the desired output: rdd2.groupByKey
This small spark-shell session illustrates the usage:
val events = List(("t1","k1","v1"), ("t2","k2","v3"), ("t3","k1","v2"), ("t4","k2","v4"), ("t5","k1","v5"))
val rdd = sc.parallelize(events)
val kv = rdd.map{case (t,k,v) => (k,v)}
val grouped = kv.groupByKey
// show the collection ('collect' used here only to show the contents)
grouped.collect
res0: Array[(String, Iterable[String])] = Array((k1,ArrayBuffer(v1, v2, v5)), (k2,ArrayBuffer(v3, v4)))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Transform input stream to key-values pairs stream - scala

Related

Unable to fetch specific content from tweets using scala on spark shell

Issue while fetching specific content from tweets using spark in scala

How to use feature extraction with DStream in Apache Spark

Why saveAsTextFile do nothing?

Apache Spark - Reduce Step Output (K, (V1,V2, V3, ...)

Categories

Resources