How to convert String Iterator into List of Tuples - scala

How can
val s = Iterator("a|b|2","a|c|3")
be converted to
List( (("a" , "b") , 2) , (("a" , "c") , 3)))
This is my current progress :
val v = s.map(m => m.split("|")(0))
How can I parse the String into its constituent parts so can be converted to a List of Tuples ?

You can match on the array returned from split:
val v = s.map(_.split('|') match { case Array(a, b, n) => ((a, b), n.toInt) })

Related

How to convert RDD[Array[String]] to RDD[(Int, HashMap[String, List])]?

I have input data:
time, id, counter, value
00.2, 1 , c1 , 0.2
00.2, 1 , c2 , 0.3
00.2, 1 , c1 , 0.1
and I want for every id to create a structure to store counters and values. After thinking about vectors and rejecting them, I came to this:
(id, Hashmap( (counter1, List(Values)), (Counter2, List(Values)) ))
(1, HashMap( (c1,List(0.2, 0.1)), (c2,List(0.3)))
The problem is that I can't convert to Hashmap inside the map transformation and additionaly I dont't know if I will be able to reduce by counter the list inside map.
Does anyone have any idea?
My code is :
val data = inputRdd
.map(y => (y(1).toInt, mutable.HashMap(y(2), List(y(3).toDouble)))).reduceByKey(_++_)
}
Off the top of my head, untested:
import collection.mutable.HashMap
inputRdd
.map{ case Array(t, id, c, v) => (id.toInt, (c, v)) }
.aggregateByKey(HashMap.empty[String, List[String]])(
{ case (m, (c, v)) => { m(c) ::= v; m } },
{ case (m1, m2) => { for ((k, v) <- m2) m1(k) ::= v ; m1 } }
)
Here's one approach:
val rdd = sc.parallelize(Seq(
("00.2", 1, "c1", 0.2),
("00.2", 1, "c2", 0.3),
("00.2", 1, "c1", 0.1)
))
rdd.
map{ case (t, i, c, v) => (i, (c, v)) }.
groupByKey.mapValues(
_.groupBy(_._1).mapValues(_.map(_._2)).map(identity)
).
collect
// res1: Array[(Int, scala.collection.immutable.Map[String,Iterable[Double]])] = Array(
// (1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
// )
Note that the final map(identity) is a remedy for the Map#mapValues not serializable problem suggested in this SO answer.
If, as you have mentioned, have inputRdd as
//inputRdd: org.apache.spark.rdd.RDD[Array[String]] = ParallelCollectionRDD[0] at parallelize at ....
Then a simple groupBy and foldLeft on the grouped values should do the trick for you to have the final desired result
val resultRdd = inputRdd.groupBy(_(1))
.mapValues(x => x
.foldLeft(Map.empty[String, List[String]]){(a, b) => {
if(a.keySet.contains(b(2))){
val c = a ++ Map(b(2) -> (a(b(2)) ++ List(b(3))))
c
}
else{
val c = a ++ Map(b(2) -> List(b(3)))
c
}
}}
)
//resultRdd: org.apache.spark.rdd.RDD[(String, scala.collection.immutable.Map[String,List[String]])] = MapPartitionsRDD[3] at mapValues at ...
//(1,Map(c1 -> List(0.2, 0.1), c2 -> List(0.3)))
changing RDD[(String, scala.collection.immutable.Map[String,List[String]])] to RDD[(Int, HashMap[String,List[String]])] would just be casting and I hope it would be easier for you to do that
I hope the answer is helpful

MapReduce example in Scala

I have this problem in Scala for a Homework.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this:
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated
You can get the max count for the first word in all distinct word pairs in a few steps:
Strip punctuations, split content into words which get lowercased
Use sliding(2) to create array of word pairs
Use reduceByKey to count occurrences of distinct word pairs
Use reduceByKey again to capture word pairs with max count for the first word
Sample code as follows:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.rdd.RDDFunctions._
val wordPairCountRDD = sc.textFile("/path/to/textfile").
flatMap( _.split("""[\s,.;:!?]+""") ).
map( _.toLowerCase ).
sliding(2).
map{ case Array(w1, w2) => ((w1, w2), 1) }.
reduceByKey( _ + _ )
val wordPairMaxRDD = wordPairCountRDD.
map{ case ((w1, w2), c) => (w1, (w2, c)) }.
reduceByKey( (acc, x) =>
if (x._2 > acc._2) (x._1, x._2) else acc
).
map{ case (w1, (w2, c)) => ((w1, w2), c) }
[UPDATE]
If you only need the word pair counts to be sorted (in descending order) per your revised requirement, you can skip step 4 and use sortBy on wordPairCountRDD:
wordPairCountRDD.
sortBy( z => (z._2, z._1._1, z._1._2), ascending = false )
This is from https://spark.apache.org/examples.html:
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
As you can see it counts the occurrence of individual words because the key-value pairs are of the form (word, 1). Which part do you need to change to count combinations of words?
This might help you: http://daily-scala.blogspot.com/2009/11/iteratorsliding.html
Well, my text uses "b" instead of "basketball" and "a", "c" for other words.
scala> val r = scala.util.Random
scala> val s = (1 to 20).map (i => List("a", "b", "c")(r.nextInt (3))).mkString (" ")
s: String = c a c b a b a a b c a b b c c a b b c b
The result is gained by split, sliding, filter, map, groupBy, map and sortBy:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))
In small steps, sliding:
scala> val counts = s.split (" ").sliding (2).toList
counts: List[Array[String]] = List(Array(c, a), Array(a, c), Array(c, b), Array(b, a), Array(a, b), Array(b, a), Array(a, a), Array(a, b), Array(b, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, c), Array(c, a), Array(a, b), Array(b, b), Array(b, c), Array(c, b))
filter:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").toList
counts: List[Array[String]] = List(Array(b, a), Array(b, a), Array(b, c), Array(b, b), Array(b, c), Array(b, b), Array(b, c))
map (_(1)) (Array access element 2)
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList
counts: List[String] = List(a, a, c, b, c, b, c)
groupBy (_(0))
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0))
counts: scala.collection.immutable.Map[Char,List[String]] = Map(b -> List(b, b), a -> List(a, a), c -> List(c, c, c))
to size of List:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}
counts: scala.collection.immutable.Map[Char,Int] = Map(b -> 2, a -> 2, c -> 3)
Finally sort descending:
scala> val counts = s.split (" ").sliding (2).filter (_(0) == "b").map (_(1)).toList.groupBy (_(0)).map { case (c: Char, l: List[String]) => (c, l.size)}.toList.sortBy (-_._2)
counts: List[(Char, Int)] = List((c,3), (b,2), (a,2))

Could Anyone explain about this code?

according this link: https://github.com/amplab/training/blob/ampcamp6/machine-learning/scala/solution/MovieLensALS.scala
I don't understand what is the point of :
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count
_._2.[user|product] , what does that mean?
That is accessing the tuple elements: The following example might explain it better.
val xs = List(
(1, "Foo"),
(2, "Bar")
)
xs.map(_._1) // => List(1,2)
xs.map(_._2) // => List("Foo", "Bar")
// An equivalent way to write this
xs.map(e => e._1)
xs.map(e => e._2)
// Perhaps a better way is
xs.collect {case (a, b) => a} // => List(1,2)
xs.collect {case (a, b) => b} // => List("Foo", "Bar")
ratings is a collection of tuples:(timestamp % 10, Rating(userId, movieId, rating)). The first underscore in _._2.user refers to the current element being processed by the map function. So the first underscore now refers to a tuple (pair of values). For a pair tuple t you can refer to its first and second elements in the shorthand notation: t._1 & t._2 So _._2 is selecting the second element of the tuple currently being processed by the map function.
val ratings = sc.textFile(movieLensHomeDir + "/ratings.dat").map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}

How to insert elements in list from another list using scala?

I have two lists -
A = (("192.168.1.1","private","Linux_server","str1"),
("192.168.1.2","private","Linux_server","str2"))
B = ("A","B")
I want following output
outputList = (("192.168.1.1","private","Linux_server","str1", "A"),
("192.168.1.2","private","Linux_server","str2","B"))
I want to insert second list element into first list as list sequence.
Two lists size will be always same.
How do I get above output using scala??
The short answer:
A = (A zip B).map({ case (x, y) => x :+ y })
Some compiling code to be more explicit:
val a = List(
List("192.168.1.1", "private", "Linux_server", "str1"),
List("192.168.1.2", "private", "Linux_server", "str2")
)
val b = List("A", "B")
val c = List(
List("192.168.1.1", "private", "Linux_server", "str1", "A"),
List("192.168.1.2", "private", "Linux_server", "str2", "B")
)
assert((a zip b).map({ case (x, y) => x :+ y }) == c)

Cartesian product of two lists

Given a map where a digit is associated to several characters
scala> val conversion = Map("0" -> List("A", "B"), "1" -> List("C", "D"))
conversion: scala.collection.immutable.Map[java.lang.String,List[java.lang.String]] =
Map(0 -> List(A, B), 1 -> List(C, D))
I want to generate all possible character sequences based on a sequence of digits. Examples:
"00" -> List("AA", "AB", "BA", "BB")
"01" -> List("AC", "AD", "BC", "BD")
I can do this with for comprehensions
scala> val number = "011"
number: java.lang.String = 011
Create a sequence of possible characters per index
scala> val values = number map { case c => conversion(c.toString) }
values: scala.collection.immutable.IndexedSeq[List[java.lang.String]] =
Vector(List(A, B), List(C, D), List(C, D))
Generate all the possible character sequences
scala> for {
| a <- values(0)
| b <- values(1)
| c <- values(2)
| } yield a+b+c
res13: List[java.lang.String] = List(ACC, ACD, ADC, ADD, BCC, BCD, BDC, BDD)
Here things get ugly and it will only work for sequences of three digits. Is there any way to achieve the same result for any sequence length?
The following suggestion is not using a for-comprehension. But I don't think it's a good idea after all, because as you noticed you'd be tied to a certain length of your cartesian product.
scala> def cartesianProduct[T](xss: List[List[T]]): List[List[T]] = xss match {
| case Nil => List(Nil)
| case h :: t => for(xh <- h; xt <- cartesianProduct(t)) yield xh :: xt
| }
cartesianProduct: [T](xss: List[List[T]])List[List[T]]
scala> val conversion = Map('0' -> List("A", "B"), '1' -> List("C", "D"))
conversion: scala.collection.immutable.Map[Char,List[java.lang.String]] = Map(0 -> List(A, B), 1 -> List(C, D))
scala> cartesianProduct("01".map(conversion).toList)
res9: List[List[java.lang.String]] = List(List(A, C), List(A, D), List(B, C), List(B, D))
Why not tail-recursive?
Note that above recursive function is not tail-recursive. This isn't a problem, as xss will be short unless you have a lot of singleton lists in xss. This is the case, because the size of the result grows exponentially with the number of non-singleton elements of xss.
I could come up with this:
val conversion = Map('0' -> Seq("A", "B"), '1' -> Seq("C", "D"))
def permut(str: Seq[Char]): Seq[String] = str match {
case Seq() => Seq.empty
case Seq(c) => conversion(c)
case Seq(head, tail # _*) =>
val t = permut(tail)
conversion(head).flatMap(pre => t.map(pre + _))
}
permut("011")
I just did that as follows and it works
def cross(a:IndexedSeq[Tree], b:IndexedSeq[Tree]) = {
a.map (p => b.map( o => (p,o))).flatten
}
Don't see the $Tree type that am dealing it works for arbitrary collections too..