Apache Spark - Tweets Processing - scala

Given a huge dataset of tweets i need to:
extract and count the hashtags.
extract and count the emoticons/emojis.
extract and count the words (lemmas)
So, the first thing that came to my mind is doing something like that:
val tweets = sparkContext.textFile(DATASET).cache
val hashtags = tweets
.map(extractHashTags)
.map((_, 1))
.reduceByKey(_ + _)
val emoticonsEmojis = tweets
.map(extractEmoticonsEmojis)
.map((_, 1))
.reduceByKey(_ + _)
val lemmas = tweets
.map(extractLemmas)
.map((_, 1))
.reduceByKey(_ + _)
But in this way each tweet is processed 3 times, is it right? If so, is there an efficient way to count all these elements separately by processing each tweet only once?
I was thinking something like that:
sparkContext.textFile(DATASET)
.map(extractor) // RDD[(List[String], List[String], List[String])]
But in this way it becomes a nightmare. Also because once i count the words (I refer to the third point of the requests), I would need to make a join with another RDD and this, in the first version, is very simple while in the second version is not.

Perhaps something like this?
sealed trait TokenType { }
object Hashtag extends TokenType
object Emoji extends TokenType
object Word extends TokenType
def extractTokens(tweet: String): Seq[(TokenType, String)] = {
...
}
val tokenCounts = tweets
.flatMap(extractTokens)
.map((_, 1))
.reduceByKey(_ + _)
val hashtagCounts = tokenCounts.collect { case ((Hashtag, x), count) => (x, count) }
// similar for emojis and words

Using Dataset API:
val tweets = sparkContext.textFile(DATASET)
val tokens = tweets.flatMap(extractor) //return RDD[(String, String)]
.toDF("type", "token") //type is one of ("hashtag", "emoticon", "lemma")
.groupBy("type", "token")
.count() //Dataset[Row] which has columns ("type", "token", "count")
val lemmas = tokens
.where($"type" === lit("lemma"))
.select("token", "count")
.as[(String, Long)]
.rdd //should be the same type as your original 'lemmas', for future join

Related

Scala flattening embedded list of lists

I have created a Twitter datastream that is displaying hashtag, author, and mentioned users in the below format.
(List(timetofly, hellocake),Shera_Eyra,List(blxcknicotine, kimtheskimm))
I can't do analysis on this format because of the embedded lists. How can I create another datastream that displays the data in this format?
timetofly, Shera_Eyra, blxcknicotine
timetofly, Shera_Eyra, kimtheskimm
hellocake, Shera_Eyra, blxcknicotine
hellocake, Shera_Eyra, kimtheskimm
Here is my code to produce the data:
val sparkConf = new SparkConf().setAppName("TwitterPopularTags")
val ssc = new StreamingContext(sparkConf, Seconds(sampleInterval))
val stream = TwitterUtils.createStream(ssc, None)
val data = stream.map {line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
In your code snippet, data is a DStream[(Array[String], String, List[String])]. To get a DStream[String] in your desired format, you can use flatMap and map:
val data = stream.map { line =>
(line.getHashtagEntities.map(_.getText),
line.getUser().getScreenName(),
line.getUserMentionEntities.map(_.getScreenName).toList)
}
val data2 = data.flatMap(a => a._1.flatMap(b => a._3.map(c => (b, a._2, c))))
.map { case (hash, user, mention) => s"$hash, $user, $mention" }
The flatMap results in a DStream[(String, String, String)] in which each tuple consists of a hash tag entity, user, and mention entity. The subsequent call to map with the pattern matching creates a DStream[String] in which each String consists of the elements in each tuple, separated by a comma and space.
I would use for comprehension for this:
val data = (List("timetofly", "hellocake"), "Shera_Eyra", List("blxcknicotine", "kimtheskimm"))
val result = for {
hashtag <- data._1
user = data._2
mentionedUser <- data._3
} yield (hashtag, user, mentionedUser)
result.foreach(println)
Output:
(timetofly,Shera_Eyra,blxcknicotine)
(timetofly,Shera_Eyra,kimtheskimm)
(hellocake,Shera_Eyra,blxcknicotine)
(hellocake,Shera_Eyra,kimtheskimm)
If you would prefer a seq of lists of strings, rather than a seq of tuples of strings, then change the yield to give you a list instead: yield List(hashtag, user, mentionedUser)

How to use reduceByKey on key of value in tuple set up as (String, (String, Int))?

I am attempting to loop through an RDD of a text file, and make a tally of each unique word in the file, and then accumulate all of the words that follow each unique word, along with their counts. So far, this is what I have:
// connecting to spark driver
val conf = new SparkConf().setAppName("WordStats").setMaster("local")
val spark = new SparkContext(conf) //Creates a new SparkContext object
//Loads the specified file into an RDD
val lines = sparkContext.textFile(System.getProperty("user.dir") + "/" + "basketball_words_only.txt")
//Splits the file into individual words
val words = lines.flatMap(line => {
val wordList = line.split(" ")
for {i <- 0 until wordList.length - 1}
yield (wordList(i), wordList(i + 1), 1)
})
If I haven't been clear thus far, what I am trying to do is to accumulate the set of words that follow each word in the file, along with the number of times said words follow.
If I understand you correctly, we have something like this:
val lines: Seq[String] = ...
val words: Seq[(String, String, Int)] = ...
And we want something like this:
val frequencies: Map[String, Seq[(String, Int)]] = {
words
.groupBy(_._1) // word -> [(w, next, cc), ...]
.mapValues { values =>
values
.map { case (w, n, cc) => (n, cc) }
.groupBy(_._1) // next -> [(next, cc), ...]
.mapValues(_.reduce(_._2 + _._2)) // next -> sum
.toSeq
}
}

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions

Apache Spark WordCount method not sorting Data

def wordCount(dataSet: RDD[String]): Map[String, Int] = {
val counts = dataSet.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, ascending = false)
counts.collectAsMap()
}
This Method is not sorting the final result as aspected
.sortBy(_._2, ascending = false)
the output of this method will be in descending order but the output is still the random
any reason or solution?
method collectAsMap() internally creates a HashMap of values which in this case not ordered. Use collect or takeOrdered for sorted values.

Scala spark reduce by key and find common value

I have a file of csv data stored in as a sequenceFile on HDFS, in the format of name, zip, country, fav_food1, fav_food2, fav_food3, fav_colour. There could be many entries with the same name and I needed to find out what their favourite food was (ie count all the food entries in all the records with that name and return the most popular one. I am new to Scala and Spark and have gone thorough multiple tutorials and scoured the forums but am stuck as to how to proceed. So far I have got the sequence files which had Text into String format and then filtered out the entries
Here is the sample data entries one to a line in the file
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
So the output should be the tuple (Bob, Soda) since soda appears the most amount of times in Bob's entries.
import org.apache.hadoop.io._
var lines = sc.sequenceFile("path",classOf[LongWritable],classOf[Text]).values.map(x => x.toString())
// converted to string since I could not get filter to run on Text and removing the longwritable
var filtered = lines.filter(_.split(",")(0) == "Bob");
// removed entries with all other users
var f_tuples = filtered.map(line => lines.split(",");
// split all the values
var f_simple = filtered.map(line => (line(0), (line(3), line(4), line(5))
// removed unnecessary fields
This Issue I have now is that I think I have this [<name,[f,f,f]>] structure and don't really know how to proceed to flatten it out and get the most popular food. I need to combine all the entries so I have a entry with a and then get the most common element in the value. Any help would be appreciated. Thanks
I tried this to get it to flatten out, but it seems the more I try, the more convoluted the data structure becomes.
var f_trial = fpairs.groupBy(_._1).mapValues(_.map(_._2))
// the resulting structure was of type org.apache.spark.rdd.RDD[(String, Interable[(String, String, String)]
here is what a println of a record looks like after f_trial
("Bob", List((Pizza, Soda,), (Chocolate, Cheese, Soda), (Chocolate, Pizza, Soda)))
Parenthesis Breakdown
("Bob",
List(
(Pizza, Soda, <missing value>),
(Chocolate, Cheese, Soda),
(Chocolate, Pizza, Soda)
) // ends List paren
) // ends first paren
I found time. Setup:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)
val data = """
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
""".trim
val records = sc.parallelize(data.split('\n'))
Extract the food choices, and for each make a tuple of ((name, food), 1)
val r2 = records.flatMap { r =>
val Array(name, id, country, food1, food2, food3, color) = r.split(',');
List(((name, food1), 1), ((name, food2), 1), ((name, food3), 1))
}
Total up each name/food combination:
val r3 = r2.reduceByKey((x, y) => x + y)
Remap so that the name (only) is the key
val r4 = r3.map { case ((name, food), total) => (name, (food, total)) }
Pick the food with the largest count at each step
val res = r4.reduceByKey((x, y) => if (y._2 > x._2) y else x)
And we're done
println(res.collect().mkString)
//(Mary,(Chips,1))(Bob,(Soda,3))
EDIT: To collect all the food items that have the same top count for a person, we just change the last two lines:
Start with a List of items with total:
val r5 = r3.map { case ((name, food), total) => (name, (List(food), total)) }
In the equal case, concatenate the list of food items with that score
val res2 = r5.reduceByKey((x, y) => if (y._2 > x._2) y
else if (y._2 < x._2) x
else (y._1:::x._1, y._2))
//(Mary,(List(Chocolate, Pasta, Chips),1))
//(Bob,(List(Soda),3))
If you want the top-3, say, then use aggregateByKey to assemble a list of the favorite foods per person instead of the second reduceByKey
Solutions provided by Paul and mattinbits shuffle your data twice - once to perform reduce-by-name-and-food and once to reduce-by-name. It is possible to solve this problem with only one shuffle.
/**Generate key-food_count pairs from a splitted line**/
def bitsToKeyMapPair(xs: Array[String]): (String, Map[String, Long]) = {
val key = xs(0)
val map = xs
.drop(3) // Drop name..country
.take(3) // Take food
.filter(_.trim.size !=0) // Ignore empty
.map((_, 1L)) // Generate k-v pairs
.toMap // Convert to Map
.withDefaultValue(0L) // Set default
(key, map)
}
/**Combine two count maps**/
def combine(m1: Map[String, Long], m2: Map[String, Long]): Map[String, Long] = {
(m1.keys ++ m2.keys).map(k => (k, m1(k) + m2(k))).toMap.withDefaultValue(0L)
}
val n: Int = ??? // Number of favorite per user
val records = lines.map(line => bitsToKeyMapPair(line.split(",")))
records.reduceByKey(combine).mapValues(_.toSeq.sortBy(-_._2).take(n))
If you're not a purist you can replace scala.collection.immutable.Map with scala.collection.mutable.Map to further improve performance.
Here's a complete example:
import org.apache.spark.{SparkContext, SparkConf}
object Main extends App {
val data = List(
"Bob,123,USA,Pizza,Soda,,Blue",
"Bob,456,UK,Chocolate,Cheese,Soda,Green",
"Bob,12,USA,Chocolate,Pizza,Soda,Yellow",
"Mary,68,USA,Chips,Pasta,Chocolate,Blue")
val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)
val lineRDD = sc.parallelize(data)
val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), List(fields(3), fields(4), fields(5)).filter(_ != ""))
}.filter(_._1 == "Bob")
/*pairedRDD.collect().foreach(println)
(Bob,List(Pizza, Soda))
(Bob,List(Chocolate, Cheese, Soda))
(Bob,List(Chocolate, Pizza, Soda))
*/
val flatPairsRDD = pairedRDD.flatMap {
case (name, foodList) => foodList.map(food => ((name, food), 1))
}
/*flatPairsRDD.collect().foreach(println)
((Bob,Pizza),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Cheese),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Pizza),1)
((Bob,Soda),1)
*/
val nameFoodSumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)
/*nameFoodSumRDD.collect().foreach(println)
((Bob,Cheese),1)
((Bob,Soda),3)
((Bob,Pizza),2)
((Bob,Chocolate),2)
*/
val resultsRDD = nameFoodSumRDD.map{
case ((name, food), count) => (name, (food,count))
}.groupByKey.map{
case (name, foodCountList) => (name, foodCountList.toList.sortBy(_._2).reverse.head)
}
resultsRDD.collect().foreach(println)
/*
(Bob,(Soda,3))
*/
sc.stop()
}