Apache Spark WordCount method not sorting Data - scala

def wordCount(dataSet: RDD[String]): Map[String, Int] = {
val counts = dataSet.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, ascending = false)
counts.collectAsMap()
}
This Method is not sorting the final result as aspected
.sortBy(_._2, ascending = false)
the output of this method will be in descending order but the output is still the random
any reason or solution?

method collectAsMap() internally creates a HashMap of values which in this case not ordered. Use collect or takeOrdered for sorted values.

Related

Whats the equivalent of RDD aggregateByKey for Spark dataframe using scala?

I wrote my below code to calculate the correlation for a dataframe using grouping but I eventually had to use RDD AggregateByKey, Sequential Operation and Combiner operation to achieve what I needed. However, I want to implement the same using only spark dataframe and avoid RDD completely. I tried learning about Spark dataframe and I came across "agg" and "Group by" function but wasn't exactly sure how to achieve the same results as using the RDD. Any help here is much appreciated?
val columnIndexes = columns.indices.map(i => i + groupIndexes.length).toArray
//removing rows with nulls in group by columns like the MR version
val cleanDF = selectedDF.na.drop("any", groupByColumns)
val allCountersPerGroupRDD: RDD[(immutable.IndexedSeq[Any], Seq[Seq[CovCounter]])] = cleanDF.rdd.map(row =>
//create key value pairs
(groupIndexes.map(ind => row.get(ind)), columnIndexes.map(i => toDouble(row.get(i)))))
.aggregateByKey(zeroCounters, numPartitions)(
seqOp = (counters, newValues) => {
for ((i, j) <- columnHalfPairedIndicesFlattened) {
counters(i)(j).addIfNotNaN(newValues(i), newValues(j))
}
counters
}, combOp = (baseCounters, otherCounters) => {
for ((i, j) <- columnHalfPairedIndicesFlattened) {
baseCounters(i)(j).merge(otherCounters(i)(j))
}
baseCounters
})
val finalRDD: RDD[Row] = allCountersPerGroupRDD.mapPartitions { iterator =>
iterator.flatMap { case (groupKeys, counts) =>
columns.indices.map(ind =>
Row.fromSeq(groupKeys ++ Seq(columns(ind)) ++ columnPairedIndicesAll(ind).map { case (i, j) =>
getCovOrCorrFromCounters(i, j, counts, useCorrelation)
}))
}
}
val outDF = sparkSession.createDataFrame(finalRDD, outputSchema)
See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html. You need to make your own UDAF.

How to filter elements from a tuple using a list as a filter

Had a filter that parsed through a list but then realised I needed to zip the original list to number each line before i filtered and now I'm not sure how to use the same filter on each of the _._2 tuple elements
val list = List("def", "var", "val")
val source = Source.fromFile("..\scala.file").getLines.toList
val filtered = source filter(line => list.exists(word => list.contains(word))))
//before
val filtered = (1 to source.length) zip source
filter(line => list.exists(word => list.contains(word))))
//after
Cannot get function working with tuple.
Supposed to filter out each tuple that doesn't contain any instances of the elements from the list
val list = List("def", "var", "val")
val matcher = list.mkString(".*(", "|", ").*")
io.Source
.fromFile("..\scala.file")
.getLines
.zipWithIndex
.filter(_._1 matches matcher)
.map{case (txt,idx) => (idx+1,txt)} //optional
.toList

Apache Spark - Tweets Processing

Given a huge dataset of tweets i need to:
extract and count the hashtags.
extract and count the emoticons/emojis.
extract and count the words (lemmas)
So, the first thing that came to my mind is doing something like that:
val tweets = sparkContext.textFile(DATASET).cache
val hashtags = tweets
.map(extractHashTags)
.map((_, 1))
.reduceByKey(_ + _)
val emoticonsEmojis = tweets
.map(extractEmoticonsEmojis)
.map((_, 1))
.reduceByKey(_ + _)
val lemmas = tweets
.map(extractLemmas)
.map((_, 1))
.reduceByKey(_ + _)
But in this way each tweet is processed 3 times, is it right? If so, is there an efficient way to count all these elements separately by processing each tweet only once?
I was thinking something like that:
sparkContext.textFile(DATASET)
.map(extractor) // RDD[(List[String], List[String], List[String])]
But in this way it becomes a nightmare. Also because once i count the words (I refer to the third point of the requests), I would need to make a join with another RDD and this, in the first version, is very simple while in the second version is not.
Perhaps something like this?
sealed trait TokenType { }
object Hashtag extends TokenType
object Emoji extends TokenType
object Word extends TokenType
def extractTokens(tweet: String): Seq[(TokenType, String)] = {
...
}
val tokenCounts = tweets
.flatMap(extractTokens)
.map((_, 1))
.reduceByKey(_ + _)
val hashtagCounts = tokenCounts.collect { case ((Hashtag, x), count) => (x, count) }
// similar for emojis and words
Using Dataset API:
val tweets = sparkContext.textFile(DATASET)
val tokens = tweets.flatMap(extractor) //return RDD[(String, String)]
.toDF("type", "token") //type is one of ("hashtag", "emoticon", "lemma")
.groupBy("type", "token")
.count() //Dataset[Row] which has columns ("type", "token", "count")
val lemmas = tokens
.where($"type" === lit("lemma"))
.select("token", "count")
.as[(String, Long)]
.rdd //should be the same type as your original 'lemmas', for future join

Use combineByKey to get output as (key, iterable[values])

I am trying to transform RDD(key,value) to RDD(key,iterable[value]), same as output returned by the groupByKey method.
But as groupByKey is not efficient, I am trying to use combineByKey on the RDD instead, however, it is not working. Below is the code used:
val data= List("abc,2017-10-04,15.2",
"abc,2017-10-03,19.67",
"abc,2017-10-02,19.8",
"xyz,2017-10-09,46.9",
"xyz,2017-10-08,48.4",
"xyz,2017-10-07,87.5",
"xyz,2017-10-04,83.03",
"xyz,2017-10-03,83.41",
"pqr,2017-09-30,18.18",
"pqr,2017-09-27,18.2",
"pqr,2017-09-26,19.2",
"pqr,2017-09-25,19.47",
"abc,2017-07-19,96.60",
"abc,2017-07-18,91.68",
"abc,2017-07-17,91.55")
val rdd = sc.parallelize(templines)
val rows = rdd.map(line => {
val row = line.split(",")
((row(0), row(1)), row(2))
})
// re partition and sort based key
val op = rows.repartitionAndSortWithinPartitions(new CustomPartitioner(4))
val temp = op.map(f => (f._1._1, (f._1._2, f._2)))
val mergeCombiners = (t1: (String, List[String]), t2: (String, List[String])) =>
(t1._1 + t2._1, t1._2.++(t2._2))
val mergeValue = (x: (String, List[String]), y: (String, String)) => {
val a = x._2.+:(y._2)
(x._1, a)
}
// createCombiner, mergeValue, mergeCombiners
val x = temp.combineByKey(
(t1: String, t2: String) => (t1, List(t2)),
mergeValue,
mergeCombiners)
temp.combineByKey is giving compile time error, I am not able to get it.
If you want a output similar from what groupByKey will give you, then you should absolutely use groupByKey and not some other method. The reduceByKey, combineByKey, etc. are only more efficient compared to using groupByKey followed with an aggregation (giving you the same result as one of the other groupBy methods could have given).
As the wanted result is an RDD[key,iterable[value]], building the list yourself or letting groupByKey do it will result in the same amount of work. There is no need to reimplement groupByKey yourself. The problem with groupByKey is not its implementation but lies in the distributed architecture.
For more information regarding groupByKey and these types of optimizations, I would recommend reading more here.

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions