Spark - Sum values of two key-value pair RDDs - scala

I have two files A and B whose contents are as following:
A
brown i like
big is house
jumps over lazy
B
this is my house
my house is brown
brown is color
I want to count the occurance of each word in each file separately and and then sum the results of so as to obtain count of all words in the two files i.e if a word occurs in both files then its final count would be the some total of its count in both files seprerately.
Following is the code I have written thus far:
val readme = sc.textFile("A.txt")
val readmesplit = readme.flatMap(line => line.split(" "))
val changes = sc.textFile("B.txt")
val changessplit = changes.flatMap(line => line.split(" "))
val readmeKV = readmesplit.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val changesKV = changessplit.map(x => (x,1)).reduceByKey((x, y) => x + y)
val ans = readmeKV.fullOuterJoin(changesKV).collect()
This code gives me the following output:
(this,(Some(1),None)), (is,(Some(3),Some(1))), (big,(None,Some(1))),
(lazy,(None,Some(1))), (house,(Some(2),Some(1))), (over,(None,Some(1)))...and so on
Now how do I sum the value tuple of each key to obtain the occurance of each word in both files.

Have you tried using union instead of fullOuterJoin? :
val ans = readmesplit.union(changessplit).map(x => (x,1)).reduceByKey((x, y) => x + y)

val totals = ans.map {
case (word, (one, two)) => (word, one.getOrElse(0) + two.getOrElse(0))
}
Just extract the two values, returning 0 if the word's not present, and add the result.

Related

Word count using map reduce on Seq[String]

I have a Seq which contains randomly generated words.
I want to calculate the occurrence count of each word using map reduce.
Now, I have been able to map the words against a value 1 and grouped them together by words.
val mapValues = ourWords.map(word => (word, 1))
val groupedData = mapValues.groupBy(_._1)
However, I am not sure how to use the reduce function on groupedData to get the count.
I tried this -
groupedData.reduce((x, y) => x._2.reduce((x, y) => x._2) + y._2.reduce((l, r) => l._2))
but it's riddled with errors.

SPARK: sum of elements with this same indexes from RDD[Array[Int]] in spark-rdd

I have three files like:
file1: 1,2,3,4,5
6,7,8,9,10
file2: 11,12,13,14,15
16,17,18,19,20
file3: 21,22,23,24,25
26,27,28,29,30
I have to find the sum of rows from each file:
1+2+3+4+5 + 11+12+13+14+15 + 21+21+23+24+25
6+7+8+9+10 + 16+17+18+19+20 + 26+27+28+29+30
I have written following code in spark-scala to get the Array of sum of all the rows:
val filesRDD = sc.wholeTextFiles("path to folder\\numbers\\*")
// creating RDD[Array[String]]
val linesRDD = filesRDD.map(elem => elem._2.split("\\n"))
// creating RDD[Array[Array[Int]]]
val rdd1 = linesRDD.map(line => line.map(str => str.split(",").map(_.trim.toInt)))
// creating RDD[Array[Int]]
val rdd2 = rdd1.map(elem => elem.map(e => e.sum))
rdd2.collect.foreach(elem => println(elem.mkString(",")))
the output I am getting is:
15,40
65,90
115,140
What I want is to sum 15+65+115 and 40+90+140
Any help is appreciated!
PS:
the files can have different no. of lines like some with 3 lines other with 4 and there can be any no. of files.
I want to do this using rdds only not dataframes.
You can use reduce to sum up the arrays:
val result = rdd2.reduce((x,y) => (x,y).zipped.map(_ + _))
// result: Array[Int] = Array(195, 270)
and if the files are of different length (e.g. file 3 has only one line 21,22,23,24,25)
val result = rdd2.reduce((x,y) => x.zipAll(y,0,0).map{case (a, b) => a + b})

calculate cosine similarity in scala

I have a file (tags.csv) that contains UserId, MovieId,tags.I want to use a domain-based method to calculate the cosine similarity between tags. I want to show the relevant tags for comedy only and measure similarity for each tag relevant to the comedy tag.
dataset
My code is:
val rows = sc.textFile("/usr/local/comedy")
val vecData = rows.map(line => Vectors.dense(line.split(", ").map(_.toDouble)))
val mat = new RowMatrix(vecData)
val exact = mat.columnSimilarities()
val approx = mat.columnSimilarities(0.07)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i, j), v) }
val MAE = exactEntries.leftOuterJoin(approxEntries).values.map {
case (u, Some(v)) =>
math.abs(u - v)
case (u, None) =>
math.abs(u)
}.mean()
but this error appear:
java.lang.NumberFormatException: For input string: "[1,898,"black comedy"]"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
What's wrong?
The error message is full of pertinent info.
NumberFormatException: For input string: "[1,898,"black comedy"]"
It looks like the input String isn't being split into separate column data. So .split(", ") isn't doing its job and it's easy to see why, there are no comma-space sequences to split on.
We could take out the space and split on just the comma but that would still leave a non-digit [ in the 1st column data and the 3rd column data has no digit characters at all.
There are a few different ways to attack this. I'd be tempted to use a regex parser.
val twoNums = "(\\d+),(\\d+),".r.unanchored
val vecData = rows.collect{ case twoNums(a, b) =>
Vectors.dense(Array(a.toDouble, b.toDouble))
}

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

Scala: write a MapReduce progam to find a word that follows a word [Homework]

I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )