Word pair cooccurence - scala

I am trying to count word pairs within the same line
So If the sentence is
"How many many people die
How many"
Expected (How,many)= 3 But my program somehow doesn't add the pair from the same line and Result (How,many) = 2
val lines = streamingContext.textFileStream(inputDir)
lines.foreachRDD(rdd => {
val pairs = rdd.map(_.split(" ")
.flatMap(_.combinations(2).map{pairs=>(pairs.mkString("-"), 1)})
pairs.saveAsTextFile(output + System.currentTimeMillis())
val collectedCounts = pairs.collect
collectedCounts.foreach(c => println(c))


SPARK: sum of elements with this same indexes from RDD[Array[Int]] in spark-rdd

I have three files like:
file1: 1,2,3,4,5
file2: 11,12,13,14,15
file3: 21,22,23,24,25
I have to find the sum of rows from each file:
1+2+3+4+5 + 11+12+13+14+15 + 21+21+23+24+25
6+7+8+9+10 + 16+17+18+19+20 + 26+27+28+29+30
I have written following code in spark-scala to get the Array of sum of all the rows:
val filesRDD = sc.wholeTextFiles("path to folder\\numbers\\*")
// creating RDD[Array[String]]
val linesRDD = filesRDD.map(elem => elem._2.split("\\n"))
// creating RDD[Array[Array[Int]]]
val rdd1 = linesRDD.map(line => line.map(str => str.split(",").map(_.trim.toInt)))
// creating RDD[Array[Int]]
val rdd2 = rdd1.map(elem => elem.map(e => e.sum))
rdd2.collect.foreach(elem => println(elem.mkString(",")))
the output I am getting is:
What I want is to sum 15+65+115 and 40+90+140
Any help is appreciated!
the files can have different no. of lines like some with 3 lines other with 4 and there can be any no. of files.
I want to do this using rdds only not dataframes.
You can use reduce to sum up the arrays:
val result = rdd2.reduce((x,y) => (x,y).zipped.map(_ + _))
// result: Array[Int] = Array(195, 270)
and if the files are of different length (e.g. file 3 has only one line 21,22,23,24,25)
val result = rdd2.reduce((x,y) => x.zipAll(y,0,0).map{case (a, b) => a + b})

Scala: write a MapReduce progam to find a word that follows a word [Homework]

I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )

Spark - Sum values of two key-value pair RDDs

I have two files A and B whose contents are as following:
brown i like
big is house
jumps over lazy
this is my house
my house is brown
brown is color
I want to count the occurance of each word in each file separately and and then sum the results of so as to obtain count of all words in the two files i.e if a word occurs in both files then its final count would be the some total of its count in both files seprerately.
Following is the code I have written thus far:
val readme = sc.textFile("A.txt")
val readmesplit = readme.flatMap(line => line.split(" "))
val changes = sc.textFile("B.txt")
val changessplit = changes.flatMap(line => line.split(" "))
val readmeKV = readmesplit.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val changesKV = changessplit.map(x => (x,1)).reduceByKey((x, y) => x + y)
val ans = readmeKV.fullOuterJoin(changesKV).collect()
This code gives me the following output:
(this,(Some(1),None)), (is,(Some(3),Some(1))), (big,(None,Some(1))),
(lazy,(None,Some(1))), (house,(Some(2),Some(1))), (over,(None,Some(1)))...and so on
Now how do I sum the value tuple of each key to obtain the occurance of each word in both files.
Have you tried using union instead of fullOuterJoin? :
val ans = readmesplit.union(changessplit).map(x => (x,1)).reduceByKey((x, y) => x + y)
val totals = ans.map {
case (word, (one, two)) => (word, one.getOrElse(0) + two.getOrElse(0))
Just extract the two values, returning 0 if the word's not present, and add the result.

Replace every word in a document based on a defined pattern

I have a file with thousands of document, and i want to read every documents and replace each word in this document like a pattern (Key,Value) at first all values are zero(0),
docs file :
Problems installing software weak tech support.
great idea executed perfect.
A perfect solution for anyone who listens to the radio.
and i have a score_file contains many words: e.g.
idea 1
software 1
weak 1
who 1
perfect 1
price -1
output like this pattern:
(Problems,0) (installing,1) (software,1) (develop,2) (weak,1) (tech,1) (support,0).
(great,1) (idea,1) (executed,2) (perfect,1).
(perfect,1) (solution,1) (for,0) (anyone,1) (who,1) (listens,1) (to,0) (the,0) (radio,0).
if a word of document occur in this score_file then value of (left word , this word , right word) in document Adding with 1 or -1 related to word's score.
i've tried :
val Words_file = sc.textFile("score_file.txt")
val Words_List = sc.broadcast(Words_file.map({ (line) =>
val Array(a,b) = line.split(" ").map(_.trim)(a,b.toInt)}).collect().toMap)
val Docs_file = sc.textFile("docs_file.txt")
val tokens = Docs_file.map(line => (line.split(" ").map(word => Words_List.value.getOrElse(word, 0)).reduce(_ + _), line.split(" ").filter(Words_List.value.contains(_) == false).mkString(" ")))
val out_Docs = tokens.map(s => if (s._2.length > 3) {s._1 + "," + s._2})
But it scored every document not its words, how can i generate my favorite output?
It is kinda hard to read you code, you seem to use a weird mix of CamelCase with underscores with vals sometimes starting with uppercase and sometimes not.
I'm not sure I completely got the question, but to get an output where each word in a given line is replaced by itself and the number coming from the other file, this might work:
val sc = new SparkContext()
val wordsFile = sc.textFile("words_file.txt")
val words = sc.broadcast(wordsFile.map( line => {
val Array(a, b) = line.split(" ").map(_.trim)
val docs = sc.textFile("docs_file.txt")
val tokens = docs.map(line => {
line.split(" ")
.map(token => s"(${token}, ${words.value.getOrElse(token, 0)})")
.mkString(" ")
tokens ends up being just an RDD[String] as the input, preserving lines (documents). I reformatted the code a bit to make it more readable.

count occurances of each word in apache spark

val sc = new SparkContext("local[4]", "wc")
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val errors = lines.filter(line => line.contains("ERROR"))
// Count all the errors
the above snippet would count the number of lines that contain the word ERROR. Is there a simplified function similar to "contains" that would return me the number of occurrences of the word instead?
say the file is in terms of Gigs and i would want to parallalize the effort using spark clusters.
Just count the instances per line and sum those together:
val errorCount = lines.map{line => line.split("[\\p{Punct}\\p{Space}]").filter(_ == "ERROR").size}.reduce(_ + _)