word count(frequency) spark rdd scala - scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)

You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.

Related

Word count using map reduce on Seq[String]

I have a Seq which contains randomly generated words.
I want to calculate the occurrence count of each word using map reduce.
Now, I have been able to map the words against a value 1 and grouped them together by words.
val mapValues = ourWords.map(word => (word, 1))
val groupedData = mapValues.groupBy(_._1)
However, I am not sure how to use the reduce function on groupedData to get the count.
I tried this -
groupedData.reduce((x, y) => x._2.reduce((x, y) => x._2) + y._2.reduce((l, r) => l._2))
but it's riddled with errors.

How to reduce shuffling and time taken by Spark while making a map of items?

I am using spark to read a csv file like this :
x, y, z
x, y
x
x, y, c, f
x, z
I want to make a map of items vs their count. This is the code I wrote :
private def genItemMap[Item: ClassTag](data: RDD[Array[Item]], partitioner: HashPartitioner): mutable.Map[Item, Long] = {
val immutableFreqItemsMap = data.flatMap(t => t)
.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
.collectAsMap()
val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
freqItemsMap
}
When I run it, it is taking a lot of time and shuffle space. Is there a way to reduce the time?
I have a 2 node cluster with 2 cores each and 8 partitions. The number of lines in the csv file are 170000.
If you just want to do an unique item count thing, then I suppose you can take the following approach.
val data: RDD[Array[Item]] = ???
val itemFrequency = data
.flatMap(arr =>
arr.map(item => (item, 1))
)
.reduceByKey(_ + _)
Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. Just keep it with the partitioning it already had.
Also... do not collect the distributed data into a local in-memory object like a Map.

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

How to do data cleansing in Scala

I just started Scala on Spark, so I am not sure if my question is workable or should I turn to other solution/tool:
I have a text file for word counting and sorting, here is the file.
I load the file into HDFS
I then use the following code in Scala to do the counting:
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(_+_).sortByKey(true,1)
counts.saveAsTextFile("Peter_SortedOutput6")
I checked the result on hdfs by hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000
Part of the result is posted here for the convenience of reading:
((For,1)
((not,1)
(1,8)
(10,8)
(11,8)
(12,8)
(13,8)
(14,8)
(15,7)
(16,7)
(17,7)
(18,7)
(19,6)
(2,8)
(20,5)
(21,5)
(22,4)
(23,2)
(24,2)
(25,2)
(3,8)
(4,8)
(5,8)
(6,8)
(7,8)
(8,8)
(9,8)
(Abraham,,1)
(According,1)
(Amen.,4)
(And,19)
(As,5)
(Asia,,1)
(Babylon,,1)
(Balaam,1)
(Be,2)
(Because,1)
First, this is really not what I expect, I want the result showing in the desc order of count.
Second, there are result like the following:
(God,25)
(God's,1)
(God,,9)
(God,),1)
(God.,6)
(God:,2)
(God;,2)
(God?,1)
How to do some cleansing in the split so these occurrences can be grouped into one (God, 47)
Thank you very much.
There is a course BerkeleyX: CS105x Introduction to Apache Spark on edx.org by Berkerly&Databricks. One of the assignment is doing word count.
The steps are
remove punctuation, by replace "[^A-Za-z0-9\s]+" with "", or not include numbers "[^A-Za-z\s]+"
trim all spaces
lower all words
we can add extra step like
remove stop words
Code as follows
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
// val reg = raw"[^A-Za-z0-9\s]+" // with numbers
val reg = raw"[^A-Za-z\s]+" // no numbers
val lines = sc.textFile("peter.txt").
map(_.replaceAll(reg, "").trim.toLowerCase).toDF("line")
val words = lines.select(split($"line", " ").alias("words"))
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")
val noStopWords = remover.transform(words)
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1))
.reduceByKey(_+_)
// from word -> num to num -> word
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
mostCommon.take(5)
Clean data by use of replaceAll:
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""));
sort by value in scala API:
.map(item => item.swap) // interchanges position of entries in each tuple
.sortByKey(true, 1) // 1st arg configures ascending sort, 2nd arg configures one task
.map(item => item.swap)
sort by value in python API:
.map(lambda (a, b): (b, a)) \
.sortByKey(1, 1) \ # 1st arg configures ascending sort, 2nd configures 1 task
.map(lambda (a, b): (b, a))
Code should look like this (you may see syntax error, please fix if any):
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""))
.map(p => (p,1))
.reduceByKey(_+_)
.map(rec => rec.swap)
.sortByKey(true, 1)
.map(rec => rec.swap)
counts.saveAsTextFile("Peter_SortedOutput6")
see scala_regular_expressions - for what [\\W] or [\\d] or [;:',.?] mean.

RDD split and do aggregation on new RDDs

I have an RDD of (String,String,Int).
I want to reduce it based on the first two strings
And Then based on the first String I want to group the (String,Int) and sort them
After sorting I need to group them into small groups each containing n elements.
I have done the code below. The problem is the number of elements in the step 2 is very large for a single key
and the reduceByKey(x++y) takes a lot of time.
//Input
val data = Array(
("c1","a1",1), ("c1","b1",1), ("c2","a1",1),("c1","a2",1), ("c1","b2",1),
("c2","a2",1), ("c1","a1",1), ("c1","b1",1), ("c2","a1",1))
val rdd = sc.parallelize(data)
val r1 = rdd.map(x => ((x._1, x._2), (x._3)))
val r2 = r1.reduceByKey((x, y) => x + y ).map(x => ((x._1._1), (x._1._2, x._2)))
// This is taking long time.
val r3 = r2.mapValues(x => ArrayBuffer(x)).reduceByKey((x, y) => x ++ y)
// from the list I will be doing grouping.
val r4 = r3.map(x => (x._1 , x._2.toList.sorted.grouped(2).toList))
Problem is the "c1" has lot of unique entries like b1 ,b2....million and reduceByKey is killing time because all the values are going to single node.
Is there a way to achieve this more efficiently?
// output
Array((c1,List(List((a1,2), (a2,1)), List((b1,2), (b2,1)))), (c2,List(List((a1,2), (a2,1)))))
There at least few problems with a way you group your data. The first problem is introduced by
mapValues(x => ArrayBuffer(x))
It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey
reduceByKey((x, y) => x ++ y)
where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.
Is there a way to achieve this more efficiently?
Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:
val r3 = r2.groupByKey
It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.
class FirsElementPartitioner(partitions: Int)
extends org.apache.spark.Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
key.asInstanceOf[(Any, Any)]._1.## % numPartitions
}
}
val r2 = r1
.reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
.mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)
// No shuffle required here.
val r3 = r2.groupByKey
It requires only a single shuffle and groupByKey is simply a local operations:
r3.toDebugString
// (8) MapPartitionsRDD[41] at groupByKey at <console>:37 []
// | MapPartitionsRDD[40] at mapPartitions at <console>:35 []
// | ShuffledRDD[39] at reduceByKey at <console>:34 []
// +-(8) MapPartitionsRDD[1] at map at <console>:28 []
// | ParallelCollectionRDD[0] at parallelize at <console>:26 []