Unable to fetch specific content from tweets using scala on spark shell - scala

I am working on Hortonworks.I have stored tweets from twitter to Kafka topic.I am performing sentiment analysis on tweets using Kafka as a Producer and Spark as a Consumer using Scala on Spark-shell.But I want to fetch only specific content from tweets like Text,HashTag,tweets is positive or negative,words from the tweets which i selected as a positive or negative word.my training data is Data.txt.
I added dependencies :
org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
Here is my code:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
I got the output like this,
(negative,fear)
(positive,fitness)
I want output like this,
(#sports,Text from the Tweets,fitness,positive)
But I am not getting the solution to store Text and Hashtag like above.

Related

Issue while fetching specific content from tweets using spark in scala

I am working on Hortonworks.I have stored tweets from twitter to Kafka topic.I am performing sentiment analysis on tweets using Kafka as a Producer and Spark as a Consumer using Scala on Spark-shell.But I want to fetch only specific content from tweets like Text, HashTag, tweets is positive or negative, words from the tweets which i selected as a positive or negative word.my training data is Data.txt.
Data.txt contain words and posititve, the negative word which is seperated by Tab....
like positive
doom negative
doomed negative
doubt positive
I added dependencies : org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
Here is my code:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
I got the output like this,
(negative,fear) (positive,fitness)
I want output like this,
(#sports,Text from the Tweets,fitness,positive)
But I am not getting the solution to store Text and Hashtag like above.

Reuse kafka producer in Spark Streaming

We have a spark streaming application(following is the code) that sources data from kafka and does some transformations(on each message) before inserting the data into MongoDB. We have a middleware application that pushes the messages(in bulk) into Kafka and waits for an acknowledgement(for each message) from spark streaming application. If the acknowledgement is not received by the middleware within a certain period of time(5seconds) after sending the message into Kafka, the middleware application re-sends the message. The spark streaming application is able to receive around 50-100 messages(in one batch) and send acknowledgement for all the messages under 5 seconds. But if the middleware application pushes over 100 messages, it is resulting in middleware application re-sending the message due to delay in spark streaming sending the acknowledgement. In our current implementation, we create the producer each time we want to send an acknowledgement, which is taking 3-4 seconds.
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val producer = new KafkaProducer[String, String](props)
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
So we tried another approach of creating the producer outside of the foreachRDD and reuse it for the entire batch interval(following is the code). This seem to have helped as we are not creating the producer each time we want to send the acknowledgement. But for some reason, when we monitor the application on the spark UI, the streaming application's memory consumption is increasing steadily, which was not the case before. We tried using the --num-executors 1 option in spark-submit to limit the number of executors that get initiated by yarn.
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
val producer = new KafkaProducer[String, String](props)
KafkaDstream.foreachRDD(rdd =>
{
rdd.collect().map ( x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val message = new ProducerRecord[String, String]("topic_name", null, "message_received")
producer.send(message)
producer.close()
}
)
}
)
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}
My questions are:
How do I monitor the spark application's memory consumption, currently we are manually monitoring the application every 5 minutes until it exhausts the memory available in our cluster(2 node 16GB each)?
What are the best practices that are followed in the industry while using Spark streaming and kafka?
Kafka is a broker: It gives you delivery guarantees for the producer and the consumer. It's overkill to implement an 'over the top' acknowledge mechanism between the producer and the consumer. Ensure that the producer behaves correctly and that the consumer can recover in case of failure and the end-2-end delivery will be ensured.
Regarding the job, there's no wonder why its performance is poor: The processing is being done sequentially, element by element up to the point of the write to the external DB. This is plain wrong and should be addressed before attempting to fix any memory consumption issues.
This process could be improved like:
val producer = // create producer
val jsonDStream = kafkaDstream.transform{rdd => rdd.map{elem =>
val json = parse(elem)
render(doAllTransformations(json)) // output should be a String-formatted JSON object
}
}
jsonDStream.foreachRDD{ rdd =>
val df = sqlContext.read.schema(outSchema).json(rdd) // transform the complete collection, not element by element
df.write.option("spark.mongodb.output.uri", "connectionURI") // write in bulk, not one by one
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
val msg = //create message
producer.send(msg)
producer.flush() // force send. *DO NOT Close* otherwise it will not be able to send any more messages
}
This process could be improved further if we could replace all the string-centric JSON transformation by case class instances.

Saving twitter streams into a single file with spark streaming, scala

So after help from this answer Spark Streaming : Join Dstream batches into single output Folder I was able to create a single file for my twitter streams. However,now I don't see any tweets being saved in this file. Please find below my code snippet for this. What am I doing wrong?
val ssc = new StreamingContext(sparkConf, Seconds(5))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(r => r.getText)
tweets.foreachRDD{rdd =>
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.map(t => Record(t)).toDF()
df.save("com.databricks.spark.csv",SaveMode.Append,Map("path"->"tweetstream.csv")
}
ssc.start()
ssc.awaitTermination()
}

Transform input stream to key-values pairs stream

I am new to Spark and Scala so my question probably is rather easy but I still struggle to find an answer. I need to join two Spark streams but I have problems with converting those streams to appropriate format. Please see my code below:
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")[0], line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")[0], line))
val newStream = pairs7.join(pairs8)
This doesn't work because "join" function expects streams in format DStream[String, String] and result of map function is DStream[(String, String)] .
And now my question is how to code this map function to get appropriate output (little explanation would be also great)?
Thanks in advance.
This works as expected:
import org.apache.spark.streaming.{Seconds, StreamingContext}
val ssc = new StreamingContext(sc, Seconds(30))
val lines7 = ssc.socketTextStream("localhost", 9997)
val pairs7 = lines7.map(line => (line.split(" ")(0), line))
val lines8 = ssc.socketTextStream("localhost", 9998)
val pairs8 = lines8.map(line => (line.split(" ")(0), line))
val newStream = pairs7.join(pairs8)
newStream.foreachRDD(rdd => println(rdd.collect.map(_.toString).mkString(",")))
ssc.start
The only issue I see is a syntax error on: line.split(" ")[0] vs line.split(" ")(0) but I guess that would be noticed by the compiler.

How to deserialize Flume's Avro events coming to Spark?

I have Flume Avro sink and SparkStreaming program that read the sink.
CDH 5.1 , Flume 1.5.0 , Spark 1.0 , using Scala as program lang on Spark
i was able to make the Spark example and count the Flume Avro Events.
however i was not able to De serialize the Flume Avro Event into string\text and then parse the structure row.
Does anyone have an example of how to do so using Scala?
You can deserialize the flume events with the below code:
val eventBody = stream.map(e => new String(e.event.getBody.array))
Here's an example of a spark streaming application for analyzing popular hashtags from twitter using a flume twitter source and avro sink to push the events to spark:
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume._
object PopularHashTags {
val conf = new SparkConf().setMaster("local[4]").setAppName("PopularHashTags").set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
def main(args: Array[String]) {
sc.setLogLevel("WARN")
System.setProperty("twitter4j.oauth.consumerKey", <consumerKey>)
System.setProperty("twitter4j.oauth.consumerSecret", <consumerSecret>)
System.setProperty("twitter4j.oauth.accessToken", <accessToken>)
System.setProperty("twitter4j.oauth.accessTokenSecret", <accessTokenSecret>)
val ssc = new StreamingContext(sc, Seconds(5))
val filter = args.takeRight(args.length)
val stream = FlumeUtils.createStream(ssc, <hostname>, <port>)
val tweets = stream.map(e => new String(e.event.getBody.array))
val hashTags = tweets.flatMap(status => status.split(" ").filter(_.startsWith("#")))
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map { case (topic, count) => (count, topic) }
.transform(_.sortByKey(false))
// Print popular hashtags
topCounts60.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 60 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
})
stream.count().map(cnt => "Received " + cnt + " flume events.").print()
ssc.start()
ssc.awaitTermination()
}
}
You can implement a custom decoder inorder to deserialize. Provide the expected type information along with it.
Try the code below:
stream.map(e => "Event:header:" + e.event.get(0).toString
+ "body: " + new String(e.event.getBody.array)).print