Scala: write a MapReduce progam to find a word that follows a word [Homework] - scala

I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.

Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )

Related

Word count using map reduce on Seq[String]

I have a Seq which contains randomly generated words.
I want to calculate the occurrence count of each word using map reduce.
Now, I have been able to map the words against a value 1 and grouped them together by words.
val mapValues = ourWords.map(word => (word, 1))
val groupedData = mapValues.groupBy(_._1)
However, I am not sure how to use the reduce function on groupedData to get the count.
I tried this -
groupedData.reduce((x, y) => x._2.reduce((x, y) => x._2) + y._2.reduce((l, r) => l._2))
but it's riddled with errors.

Get max term and number

I did a map reduce which counts the terms of book titles and counts them using scala. I want to output both the term and the number but only get the number using:
println("max term :" +wordCount.reduce( (a,b)=> ("max", a._2 max b._2))._2)
I was wondering how I also include the term.
Thank you
Example:
("The", 5)
("Of", 8)
("is", 10)
…
my current code gives me the maximum number but I don't know how to get the term in.
Initial code:
val inputPR2Q1
val inputPR2Q1 = sc.textFile("/root/pagecounts-20160101-000000")
val titlecolumn = inputPR2Q1.map(line => line.split(" ")(1))
val wordCount = titlecolumn.flatMap(line => line.split("_")).map(word => (word,1)).reduceByKey(_ + _);
Here I just take a file containing book titles with other data. I take book titles alone and do a MapReduce to count and sum each term in the titles separately.
Use .sortBy with ascending=false and take(1) on RDD
sc.textFile("/root/pagecounts-20160101-000000").
map(line => line.split(" ")(1)).
flatMap(line => line.split("_")).
map(word => (word,1)).
reduceByKey(_ + _).
sortBy(_._2,ascending=false).
take(1)
I would suggest you take a look to the scaladoc.
You can just use sortBy.
val (maxTerm, count) = wordCount.sortBy(_._2, ascending = false).take(1).head

How to reduce shuffling and time taken by Spark while making a map of items?

I am using spark to read a csv file like this :
x, y, z
x, y
x
x, y, c, f
x, z
I want to make a map of items vs their count. This is the code I wrote :
private def genItemMap[Item: ClassTag](data: RDD[Array[Item]], partitioner: HashPartitioner): mutable.Map[Item, Long] = {
val immutableFreqItemsMap = data.flatMap(t => t)
.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
.collectAsMap()
val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
freqItemsMap
}
When I run it, it is taking a lot of time and shuffle space. Is there a way to reduce the time?
I have a 2 node cluster with 2 cores each and 8 partitions. The number of lines in the csv file are 170000.
If you just want to do an unique item count thing, then I suppose you can take the following approach.
val data: RDD[Array[Item]] = ???
val itemFrequency = data
.flatMap(arr =>
arr.map(item => (item, 1))
)
.reduceByKey(_ + _)
Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. Just keep it with the partitioning it already had.
Also... do not collect the distributed data into a local in-memory object like a Map.

Spark - Sum values of two key-value pair RDDs

I have two files A and B whose contents are as following:
A
brown i like
big is house
jumps over lazy
B
this is my house
my house is brown
brown is color
I want to count the occurance of each word in each file separately and and then sum the results of so as to obtain count of all words in the two files i.e if a word occurs in both files then its final count would be the some total of its count in both files seprerately.
Following is the code I have written thus far:
val readme = sc.textFile("A.txt")
val readmesplit = readme.flatMap(line => line.split(" "))
val changes = sc.textFile("B.txt")
val changessplit = changes.flatMap(line => line.split(" "))
val readmeKV = readmesplit.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val changesKV = changessplit.map(x => (x,1)).reduceByKey((x, y) => x + y)
val ans = readmeKV.fullOuterJoin(changesKV).collect()
This code gives me the following output:
(this,(Some(1),None)), (is,(Some(3),Some(1))), (big,(None,Some(1))),
(lazy,(None,Some(1))), (house,(Some(2),Some(1))), (over,(None,Some(1)))...and so on
Now how do I sum the value tuple of each key to obtain the occurance of each word in both files.
Have you tried using union instead of fullOuterJoin? :
val ans = readmesplit.union(changessplit).map(x => (x,1)).reduceByKey((x, y) => x + y)
val totals = ans.map {
case (word, (one, two)) => (word, one.getOrElse(0) + two.getOrElse(0))
}
Just extract the two values, returning 0 if the word's not present, and add the result.

Replace every word in a document based on a defined pattern

I have a file with thousands of document, and i want to read every documents and replace each word in this document like a pattern (Key,Value) at first all values are zero(0),
docs file :
Problems installing software weak tech support.
great idea executed perfect.
A perfect solution for anyone who listens to the radio.
...
and i have a score_file contains many words: e.g.
idea 1
software 1
weak 1
who 1
perfect 1
price -1
...
output like this pattern:
(Problems,0) (installing,1) (software,1) (develop,2) (weak,1) (tech,1) (support,0).
(great,1) (idea,1) (executed,2) (perfect,1).
(perfect,1) (solution,1) (for,0) (anyone,1) (who,1) (listens,1) (to,0) (the,0) (radio,0).
if a word of document occur in this score_file then value of (left word , this word , right word) in document Adding with 1 or -1 related to word's score.
i've tried :
val Words_file = sc.textFile("score_file.txt")
val Words_List = sc.broadcast(Words_file.map({ (line) =>
val Array(a,b) = line.split(" ").map(_.trim)(a,b.toInt)}).collect().toMap)
val Docs_file = sc.textFile("docs_file.txt")
val tokens = Docs_file.map(line => (line.split(" ").map(word => Words_List.value.getOrElse(word, 0)).reduce(_ + _), line.split(" ").filter(Words_List.value.contains(_) == false).mkString(" ")))
val out_Docs = tokens.map(s => if (s._2.length > 3) {s._1 + "," + s._2})
But it scored every document not its words, how can i generate my favorite output?
It is kinda hard to read you code, you seem to use a weird mix of CamelCase with underscores with vals sometimes starting with uppercase and sometimes not.
I'm not sure I completely got the question, but to get an output where each word in a given line is replaced by itself and the number coming from the other file, this might work:
val sc = new SparkContext()
val wordsFile = sc.textFile("words_file.txt")
val words = sc.broadcast(wordsFile.map( line => {
val Array(a, b) = line.split(" ").map(_.trim)
(a,b.toInt)
}).collect().toMap)
val docs = sc.textFile("docs_file.txt")
val tokens = docs.map(line => {
line.split(" ")
.map(token => s"(${token}, ${words.value.getOrElse(token, 0)})")
.mkString(" ")
})
tokens ends up being just an RDD[String] as the input, preserving lines (documents). I reformatted the code a bit to make it more readable.