Word count using map reduce on Seq[String] - scala

I have a Seq which contains randomly generated words.
I want to calculate the occurrence count of each word using map reduce.
Now, I have been able to map the words against a value 1 and grouped them together by words.
val mapValues = ourWords.map(word => (word, 1))
val groupedData = mapValues.groupBy(_._1)
However, I am not sure how to use the reduce function on groupedData to get the count.
I tried this -
groupedData.reduce((x, y) => x._2.reduce((x, y) => x._2) + y._2.reduce((l, r) => l._2))
but it's riddled with errors.

Related

Get max term and number

I did a map reduce which counts the terms of book titles and counts them using scala. I want to output both the term and the number but only get the number using:
println("max term :" +wordCount.reduce( (a,b)=> ("max", a._2 max b._2))._2)
I was wondering how I also include the term.
Thank you
Example:
("The", 5)
("Of", 8)
("is", 10)
…
my current code gives me the maximum number but I don't know how to get the term in.
Initial code:
val inputPR2Q1
val inputPR2Q1 = sc.textFile("/root/pagecounts-20160101-000000")
val titlecolumn = inputPR2Q1.map(line => line.split(" ")(1))
val wordCount = titlecolumn.flatMap(line => line.split("_")).map(word => (word,1)).reduceByKey(_ + _);
Here I just take a file containing book titles with other data. I take book titles alone and do a MapReduce to count and sum each term in the titles separately.
Use .sortBy with ascending=false and take(1) on RDD
sc.textFile("/root/pagecounts-20160101-000000").
map(line => line.split(" ")(1)).
flatMap(line => line.split("_")).
map(word => (word,1)).
reduceByKey(_ + _).
sortBy(_._2,ascending=false).
take(1)
I would suggest you take a look to the scaladoc.
You can just use sortBy.
val (maxTerm, count) = wordCount.sortBy(_._2, ascending = false).take(1).head

Scala: write a MapReduce progam to find a word that follows a word [Homework]

I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

word count(frequency) spark rdd scala

if I have an rdd accross cluster and I want to do the word count
not only count the appear times,
I want to get the frequency, which is defined as count/total count
What is the best and efficient way to do so in scala?
How can I do reduction job and calculate total number at the same time within one workflow?
BTW I know purely word count can be done in this way.
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
but what is the difference if I use aggregate? in terms of spark job workflow
val result = pairs
.aggregate(Map[String, Int]())((acc, pair) =>
if(acc.contains(pair._1))
acc ++ Map[String, Int]((pair._1, acc(pair._1)+1))
else
acc ++ Map[String, Int]((pair._1, pair._2))
,
(a, b) =>
(a.toSeq ++ b.toSeq)
.groupBy(_._1)
.mapValues(_.map(_._2).reduce(_ + _))
)
You can use this
val total = counts.map(x => x._2).sum()
val freq = counts.map(x => (x._1, x._2/total))
There exists also the concept of Accumulator which is a write-only variable and you could use it to avoid using the sum() action, but your code would need a lot of change.

Spark - Sum values of two key-value pair RDDs

I have two files A and B whose contents are as following:
A
brown i like
big is house
jumps over lazy
B
this is my house
my house is brown
brown is color
I want to count the occurance of each word in each file separately and and then sum the results of so as to obtain count of all words in the two files i.e if a word occurs in both files then its final count would be the some total of its count in both files seprerately.
Following is the code I have written thus far:
val readme = sc.textFile("A.txt")
val readmesplit = readme.flatMap(line => line.split(" "))
val changes = sc.textFile("B.txt")
val changessplit = changes.flatMap(line => line.split(" "))
val readmeKV = readmesplit.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val changesKV = changessplit.map(x => (x,1)).reduceByKey((x, y) => x + y)
val ans = readmeKV.fullOuterJoin(changesKV).collect()
This code gives me the following output:
(this,(Some(1),None)), (is,(Some(3),Some(1))), (big,(None,Some(1))),
(lazy,(None,Some(1))), (house,(Some(2),Some(1))), (over,(None,Some(1)))...and so on
Now how do I sum the value tuple of each key to obtain the occurance of each word in both files.
Have you tried using union instead of fullOuterJoin? :
val ans = readmesplit.union(changessplit).map(x => (x,1)).reduceByKey((x, y) => x + y)
val totals = ans.map {
case (word, (one, two)) => (word, one.getOrElse(0) + two.getOrElse(0))
}
Just extract the two values, returning 0 if the word's not present, and add the result.