MapReduce example in Scala - scala

I have this problem in Scala for a Homework.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this:
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated

You can get the max count for the first word in all distinct word pairs in a few steps:
Strip punctuations, split content into words which get lowercased
Use sliding(2) to create array of word pairs
Use reduceByKey to count occurrences of distinct word pairs
Use reduceByKey again to capture word pairs with max count for the first word
Sample code as follows:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.rdd.RDDFunctions._
val wordPairCountRDD = sc.textFile("/path/to/textfile").
flatMap( _.split("""[\s,.;:!?]+""") ).
map( _.toLowerCase ).
sliding(2).
map{ case Array(w1, w2) => ((w1, w2), 1) }.
reduceByKey( _ + _ )
val wordPairMaxRDD = wordPairCountRDD.
map{ case ((w1, w2), c) => (w1, (w2, c)) }.
reduceByKey( (acc, x) =>
if (x._2 > acc._2) (x._1, x._2) else acc
).
map{ case (w1, (w2, c)) => ((w1, w2), c) }
[UPDATE]
If you only need the word pair counts to be sorted (in descending order) per your revised requirement, you can skip step 4 and use sortBy on wordPairCountRDD:
wordPairCountRDD.
sortBy( z => (z._2, z._1._1, z._1._2), ascending = false )

This is from https://spark.apache.org/examples.html:
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
As you can see it counts the occurrence of individual words because the key-value pairs are of the form (word, 1). Which part do you need to change to count combinations of words?
This might help you: http://daily-scala.blogspot.com/2009/11/iteratorsliding.html

Well, my text uses "b" instead of "basketball" and "a", "c" for other words.
scala> val r = scala.util.Random
scala> val s = (1 to 20).map (i => List("a", "b", "c")(r.nextInt (3))).mkString (" ")
s: String = c a c b a b a a b c a b b c c a b b c b
The result is gained by split, sliding, filter, map, groupBy, map and sortBy:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))
In small steps, sliding:
scala> val counts = s.split (" ").sliding (2).toList
counts: List[Array[String]] = List(Array(c, a), Array(a, c), Array(c, b), Array(b, a), Array(a, b), Array(b, a), Array(a, a), Array(a, b), Array(b, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, b))
filter:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").toList
counts: List[Array[String]] = List(Array(b, a), Array(b, a), Array(b, c), Array(b, b), Array(b, c), Array(b, b), Array(b, c))
map (_(1)) (Array access element 2)
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList
counts: List[String] = List(a, a, c, b, c, b, c)
groupBy (_(0))
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0))
counts: scala.collection.immutable.Map[Char,List[String]] = Map(b -> List(b, b), a -> List(a, a), c -> List(c, c, c))
to size of List:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)
Finally sort descending:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))

Related

Scala different outputs of logically same programs

val a = List(1, 2, 3, 4, 5)
val b = a.grouped(2).filter(_.length == 2).map(x => (x(0), x(1)))
//b.foreach(x => println(x))
val r = b.foldLeft((0, 0)) {
case ((m, n), (x, y)) => {
(m + x, n + y)
}
}
println(r)
The program gives correct output (4, 6) for the above program. But when I uncomment the foreach statement above it outputs (0, 0). What's wrong here?
val b = a.grouped(2).filter(_.length == 2).map(x => (x(0), x(1))), b's type is Iterator:
scala> :type b
Iterator[(Int, Int)]
so when you have iterated b by b.foreach(x => println(x)), after this the current iterator b is empty, Since Iterator only can be iterated once.

How to convert String Iterator into List of Tuples

How can
val s = Iterator("a|b|2","a|c|3")
be converted to
List( (("a" , "b") , 2) , (("a" , "c") , 3)))
This is my current progress :
val v = s.map(m => m.split("|")(0))
How can I parse the String into its constituent parts so can be converted to a List of Tuples ?
You can match on the array returned from split:
val v = s.map(_.split('|') match { case Array(a, b, n) => ((a, b), n.toInt) })

How to use an autoincrement index in for comprehension in Scala

Is it possible to use an autoincrement counter in for comprehensions in Scala?
something like
for (element <- elements; val counter = counter+1) yield NewElement(element, counter)
I believe, that you are looking for zipWithIndex method available on List and other collections. Here is small example of it's usage:
scala> val list = List("a", "b", "c")
list: List[java.lang.String] = List(a, b, c)
scala> list.zipWithIndex
res0: List[(java.lang.String, Int)] = List((a,0), (b,1), (c,2))
scala> list.zipWithIndex.map{case (elem, idx) => elem + " with index " + idx}
res1: List[java.lang.String] = List(a with index 0, b with index 1, c with index 2)
scala> for ((elem, idx) <- list.zipWithIndex) yield elem + " with index " + idx
res2: List[java.lang.String] = List(a with index 0, b with index 1, c with index 2)
A for comprehension is not like a for loop in that the terms are evaluated for each previous term. As an example, look at the results below. I don't think that's what you are looking for:
scala> val elements = List("a", "b", "c", "d")
elements: List[java.lang.String] = List(a, b, c, d)
scala> for (e <- elements; i <- 0 until elements.length) yield (e, i)
res2: List[(java.lang.String, Int)] = List((a,0), (a,1), (a,2), (a,3), (b,0), (b,1), (b,2), (b,3), (c,0), (c,1), (c,2), (c,3), (d,0), (d,1), (d,2), (d,3))
tenshi's answer is probably more on track with your desired result, but I hope this counterexample is useful.

Cartesian product of two lists

Given a map where a digit is associated to several characters
scala> val conversion = Map("0" -> List("A", "B"), "1" -> List("C", "D"))
conversion: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] =
Map(0 -> List(A, B), 1 -> List(C, D))
I want to generate all possible character sequences based on a sequence of digits. Examples:
"00" -> List("AA", "AB", "BA", "BB")
"01" -> List("AC", "AD", "BC", "BD")
I can do this with for comprehensions
scala> val number = "011"
number: java.lang.String = 011
Create a sequence of possible characters per index
scala> val values = number map { case c => conversion(c.toString) }
values: scala.collection.immutable.IndexedSeq[List[java.lang.String]] =
Vector(List(A, B), List(C, D), List(C, D))
Generate all the possible character sequences
scala> for {
| a <- values(0)
| b <- values(1)
| c <- values(2)
| } yield a+b+c
res13: List[java.lang.String] = List(ACC, ACD, ADC, ADD, BCC, BCD, BDC, BDD)
Here things get ugly and it will only work for sequences of three digits. Is there any way to achieve the same result for any sequence length?
The following suggestion is not using a for-comprehension. But I don't think it's a good idea after all, because as you noticed you'd be tied to a certain length of your cartesian product.
scala> def cartesianProduct[T](xss: List[List[T]]): List[List[T]] = xss match {
| case Nil => List(Nil)
| case h :: t => for(xh <- h; xt <- cartesianProduct(t)) yield xh :: xt
| }
cartesianProduct: [T](xss: List[List[T]])List[List[T]]
scala> val conversion = Map('0' -> List("A", "B"), '1' -> List("C", "D"))
conversion: scala.collection.immutable.Map[Char,List[java.lang.String]] = Map(0 -> List(A, B), 1 -> List(C, D))
scala> cartesianProduct("01".map(conversion).toList)
res9: List[List[java.lang.String]] = List(List(A, C), List(A, D), List(B, C), List(B, D))
Why not tail-recursive?
Note that above recursive function is not tail-recursive. This isn't a problem, as xss will be short unless you have a lot of singleton lists in xss. This is the case, because the size of the result grows exponentially with the number of non-singleton elements of xss.
I could come up with this:
val conversion = Map('0' -> Seq("A", "B"), '1' -> Seq("C", "D"))
def permut(str: Seq[Char]): Seq[String] = str match {
case Seq() => Seq.empty
case Seq(c) => conversion(c)
case Seq(head, tail # _*) =>
val t = permut(tail)
conversion(head).flatMap(pre => t.map(pre + _))
}
permut("011")
I just did that as follows and it works
def cross(a:IndexedSeq[Tree], b:IndexedSeq[Tree]) = {
a.map (p => b.map( o => (p,o))).flatten
}
Don't see the $Tree type that am dealing it works for arbitrary collections too..

Expand a Set[Set[String]] into Cartesian Product in Scala

I have the following set of sets. I don't know ahead of time how long it will be.
val sets = Set(Set("a","b","c"), Set("1","2"), Set("S","T"))
I would like to expand it into a cartesian product:
Set("a&1&S", "a&1&T", "a&2&S", ..., "c&2&T")
How would you do that?
I think I figured out how to do that.
def combine(acc:Set[String], set:Set[String]) = for (a <- acc; s <- set) yield {
a + "&" + s
}
val expanded = sets.reduceLeft(combine)
expanded: scala.collection.immutable.Set[java.lang.String] = Set(b&2&T, a&1&S,
a&1&T, b&1&S, b&1&T, c&1&T, a&2&T, c&1&S, c&2&T, a&2&S, c&2&S, b&2&S)
Nice question. Here's one way:
scala> val seqs = Seq(Seq("a","b","c"), Seq("1","2"), Seq("S","T"))
seqs: Seq[Seq[java.lang.String]] = List(List(a, b, c), List(1, 2), List(S, T))
scala> val seqs2 = seqs.map(_.map(Seq(_)))
seqs2: Seq[Seq[Seq[java.lang.String]]] = List(List(List(a), List(b), List(c)), List(List(1), List(2)), List(List(S), List(T)))
scala> val combined = seqs2.reduceLeft((xs, ys) => for {x <- xs; y <- ys} yield x ++ y)
combined: Seq[Seq[java.lang.String]] = List(List(a, 1, S), List(a, 1, T), List(a, 2, S), List(a, 2, T), List(b, 1, S), List(b, 1, T), List(b, 2, S), List(b, 2, T), List(c, 1, S), List(c, 1, T), List(c, 2, S), List(c, 2, T))
scala> combined.map(_.mkString("&"))
res11: Seq[String] = List(a&1&S, a&1&T, a&2&S, a&2&T, b&1&S, b&1&T, b&2&S, b&2&T, c&1&S, c&1&T, c&2&S, c&2&T)
Came after the batle ;) but another one:
sets.reduceLeft((s0,s1)=>s0.flatMap(a=>s1.map(a+"&"+_)))
Expanding on dsg's answer, you can write it more clearly (I think) this way, if you don't mind the curried function:
def combine[A](f: A => A => A)(xs:Iterable[Iterable[A]]) =
xs reduceLeft { (x, y) => x.view flatMap { y map f(_) } }
Another alternative (slightly longer, but much more readable):
def combine[A](f: (A, A) => A)(xs:Iterable[Iterable[A]]) =
xs reduceLeft { (x, y) => for (a <- x.view; b <- y) yield f(a, b) }
Usage:
combine[String](a => b => a + "&" + b)(sets) // curried version
combine[String](_ + "&" + _)(sets) // uncurried version
Expanding on #Patrick's answer.
Now it's more general and lazier:
def combine[A](f:(A, A) => A)(xs:Iterable[Iterable[A]]) =
xs.reduceLeft { (x, y) => x.view.flatMap {a => y.map(f(a, _)) } }
Having it be lazy allows you to save space, since you don't store the exponentially many items in the expanded set; instead, you generate them on the fly. But, if you actually want the full set, you can still get it like so:
val expanded = combine{(x:String, y:String) => x + "&" + y}(sets).toSet