scala flatmap to get correct data structure - scala

I have parse the data and generated following RDD:
x [RDD] = (458817,(CompactBuffer(20),CompactBuffer((837063182,0,1433142639864), (676690466,0,1433175090184), (4642913327036075112,1,1433177284025), (464291332,1,1433182403135), (4642913327036075112,0,1433185531150),
(464291332,0,1433186067803), (4642913327036075112,1,1433186266561), (851805971,0,1433190829047),
(6376558263039679112,1,1433203286945), (837063182,0,1433226615856), (8403476884799939112,0,1433287740066),
(764990231,0,1433289484047), (4642913327036075112,0,1433351165901), (464291332,1,1433351892238),
(4642913327036075112,0,1433374808826), (584492430,1,1433436093253))))
Here I am only showing a record which is in the RDD, My goal is to get the following RDD: Where I attached first element.
(458817,837063182,0,1433142639864)
(458817,676690466,0,1433175090184)
(458817,464291332,1,1433177284025)
(458817,464291332,1,1433182403135)
(458817,464291332,0,1433185531150)
(458817,464291332,0,1433186067803)
(458817,464291332,1,1433186266561)
(458817,851805971,0,1433190829047)
(458817,637655826,1,1433203286945)
(458817,837063182,0,1433226615856)
By doing a flatMap I loose the first element and doesn't get access to it:
val r = x.map(l => l._2).flatMap(x => x._2).map(x => (x._1, x._2, x._3, x._4))

This would probably give you the wanted result:
val r = for {
el <- Seq(x._1)
(el1, el2, el3) <- x._2._2
} yield (el, el1, el2, el3)
Lift the first element to a Sequence to use it in the for expression.
Pull out the second CompactBuffer and yield the wanted tuples.

This gave me the exact structure I wanted.
val s = r.map(x => (x._2._2).map(y => (x._1, y._1, y._2.toInt, y._3, y._4))).flatMap(k => k)

Related

Scala Map from Test File

Looking to create a scala map from a test file. A sample of the text file (a few lines of it) can be seen below:
Alabama (9),Democratic:849624,Republican:1441170,Libertarian:25176,Others:7312
Alaska (3),Democratic:153778,Republican:189951,Libertarian:8897,Others:6904
Arizona (11),Democratic:1672143,Republican:1661686,Libertarian:51465,Green:1557,Others:475
I have been given the map buffer as follows:
var mapBuffer: Map[String, List[(String, Int)]] = Map()
Note the party values are separated by a colon.
I am trying to read the file contents and store the data in a map structure where each line of the file is used to construct a map entry with the date as the key, and a list of tuples as the value. The type of the structure should be Map[String, List[(String,Int)]].
Essentially just trying to create a map of each line from the file but I can't quite get it right. I tried the below but with not luck - I think that 'val lines' should be an array rather than an iterator.
val stream : InputStream = getClass.getResourceAsStream("")
val lines: Iterator[String] = scala.io.Source.fromInputStream(stream).getLines
var map: Map[String, List[(String, Int)]] = lines
.map(_.split(","))
.map(line => (line(0).toString, line(1).toList))
.toMap
This appears to do the job. (Scala 2.13.x)
val stateVotes =
util.Using(io.Source.fromFile("votes.txt")){
val PartyVotes = "([^:]+):(\\d+)".r
_.getLines()
.map(_.split(",").toList)
.toList
.groupMapReduce(_.head)(_.tail.collect{
case PartyVotes(p,v) => (p,v.toInt)})(_ ++ _)
} //file is auto-closed
//stateVotes: Try[Map[String,List[(String, Int)]]] = Success(
// Map(Alabama (9) -> List((Democratic,849624), (Republican,1441170), (Libertarian,25176), (Others,7312))
// , Arizona (11) -> List((Democratic,1672143), (Republican,1661686), (Libertarian,51465), (Green,1557), (Others,475))
// , Alaska (3) -> List((Democratic,153778), (Republican,189951), (Libertarian,8897), (Others,6904))))
In this case the number following the state name is preserved. That can be changed.
No, iterator is fine (better than list actually),
you just need to split the values too to create those tuples.
lines
.map(_.split(","))
.map { case l =>
l.head -> l.tail.toList.map(_.split(":"))
.collect { case Seq(a,b) => a -> b.toInt }
}
.toMap
An alternative that looks a little bit more aesthetic to my eye is converting to map early, and then using mapValues (I personally much
prefer short lambdas). The downside is mapValues is lazy, so you end up
having to do .toMap twice to force it in the end:
lines
.map(_.split(","))
.map { case l => l.head -> l.tail.toList }
.toMap
.mapValues(_.split(":"))
.mapValues(_.collect { case Seq(a,b) => a -> b.toInt })
.toMap

How to Reduce by key in "Scala" [Not In Spark]

I am trying to reduceByKeys in Scala, is there any method to reduce the values based on the keys in Scala. [ i know we can do by reduceByKey method in spark, but how do we do the same in Scala ? ]
The input Data is :
val File = Source.fromFile("C:/Users/svk12/git/data/retail_db/order_items/part-00000")
.getLines()
.toList
val map = File.map(x => x.split(","))
.map(x => (x(1),x(4)))
map.take(10).foreach(println)
After Above Step i am getting the result as:
(2,250.0)
(2,129.99)
(4,49.98)
(4,299.95)
(4,150.0)
(4,199.92)
(5,299.98)
(5,299.95)
Expected Result :
(2,379.99)
(5,499.93)
.......
Starting Scala 2.13, you can use the groupMapReduce method which is (as its name suggests) an equivalent of a groupBy followed by mapValues and a reduce step:
io.Source.fromFile("file.txt")
.getLines.to(LazyList)
.map(_.split(','))
.groupMapReduce(_(1))(_(4).toDouble)(_ + _)
The groupMapReduce stage:
groups splited arrays by their 2nd element (_(1)) (group part of groupMapReduce)
maps each array occurrence within each group to its 4th element and cast it to Double (_(4).toDouble) (map part of groupMapReduce)
reduces values within each group (_ + _) by summing them (reduce part of groupMapReduce).
This is a one-pass version of what can be translated by:
seq.groupBy(_(1)).mapValues(_.map(_(4).toDouble).reduce(_ + _))
Also note the cast from Iterator to LazyList in order to use a collection which provides groupMapReduce (we don't use a Stream, since starting Scala 2.13, LazyList is the recommended replacement of Streams).
It looks like you want the sum of some values from a file. One problem is that files are strings, so you have to cast the String to a number format before it can be summed.
These are the steps you might use.
io.Source.fromFile("so.txt") //open file
.getLines() //read line-by-line
.map(_.split(",")) //each line is Array[String]
.toSeq //to something that can groupBy()
.groupBy(_(1)) //now is Map[String,Array[String]]
.mapValues(_.map(_(4).toInt).sum) //now is Map[String,Int]
.toSeq //un-Map it to (String,Int) tuples
.sorted //presentation order
.take(10) //sample
.foreach(println) //report
This will, of course, throw if any file data is not in the required format.
There is nothing built-in, but you can write it like this:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
var result = Map.empty[A, B]
items.foreach {
case (a, b) =>
result += (a -> result.get(a).map(b1 => f(b1, b)).getOrElse(b))
}
result
}
There is some space to optimize this (e.g. use mutable maps), but the general idea remains the same.
Another approach, more declarative but less efficient (creates several intermediate collections; can be rewritten but with loss of clarity:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
items
.groupBy { case (a, _) => a }
.mapValues(_.map { case (_, b) => b }.reduce(f))
// mapValues returns a view, view.force changes it back to a realized map
.view.force
}
First group the tuple using key, first element here and then reduce.
Following code will work -
val reducedList = map.groupBy(_._1).map(l => (l._1, l._2.map(_._2).reduce(_+_)))
print(reducedList)
Here another solution using a foldLeft:
val File : List[String] = ???
File.map(x => x.split(","))
.map(x => (x(1),x(4).toInt))
.foldLeft(Map.empty[String,Int]){case (state, (key,value)) => state.updated(key,state.get(key).getOrElse(0)+value)}
.toSeq
.sortBy(_._1)
.take(10)
.foreach(println)

How to convert (key,array(value)) to (key,value) in Spark

I have a RDD like below:
val rdd1 = sc.parallelize(Array((1,Array((3,4),(4,5))),(2,Array((4,2),(4,4),(3,9)))))
which is RDD[(Int,Array[(Int,Int)])] I want to get the result like RDD[(Int,(Int,Int)] by some operations such as flatMap or else. In this example, the result should be:
(1,(3,4))
(1,(4,5))
(2,(4,2))
(2,(4,4))
(2,(3,9))
I am quite new to spark, so what could I do to achieve this?
Thanks a lot.
you can use flatMap in your case like this :
val newRDD: RDD[(Int, (Int, Int))] = rdd1
.flatMap { case (k, values) => values.map(v => (k, v))}
Assume that as RDD as rd. Use below code to get the data as you want
rdd1.flatMap(x => x._2.map(y => (x._1,y)))
Internal map method in flatmap read x._2 which is array and read each value of array at a time as y. After that flat map will give them as separate items. x._1 is the first value in the RDD.

Using contains in scala - exception

I am encountering this error:
java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to [Ljava.lang.Object;
whenever I try to use "contains" to find if a string is inside an array. Is there a more appropriate way of doing this? Or, am I doing something wrong? (I am fairly new to Scala)
Here is the code:
val matches = Set[JSONObject]()
val config = new SparkConf()
val sc = new SparkContext("local", "SparkExample", config)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val ebay = sqlContext.read.json("/Users/thomassquires/Downloads/products.json")
val catalogue = sqlContext.read.json("/Users/thomassquires/Documents/catalogue2.json")
val eins = ebay.map(item => (item.getAs[String]("ID"), Option(item.getAs[Set[Row]]("itemSpecifics"))))
.filter(item => item._2.isDefined)
.map(item => (item._1 , item._2.get.find(x => x.getAs[String]("k") == "EAN")))
.filter(x => x._2.isDefined)
.map(x => (x._1, x._2.get.getAs[String]("v")))
.collect()
def catEins = catalogue.map(r => (r.getAs[String]("_id"), Option(r.getAs[Array[String]]("item_model_number")))).filter(r => r._2.isDefined).map(r => (r._1, r._2.get)).collect()
def matched = for(ein <- eins) yield (ein._1, catEins.filter(z => z._2.contains(ein._2)))
The exception occurs on the last line. I have tried a few different variants.
My data structure is one List[Tuple2[String, String]] and one List[Tuple2[String, Array[String]]] . I need to find the zero or more matches from the second list that contain the string.
Thanks
Long story short (there is still part that eludes me here*) you're using wrong types. getAs is implemented as fieldIndex (String => Int) followed by get (Int => Any) followed by asInstanceOf.
Since Spark doesn't use Arrays nor Sets but WrappedArray to store array column data, calls like getAs[Array[String]] or getAs[Set[Row]] are not valid. If you want specific types you should use either getAs[Seq[T]] or getAsSeq[T] and convert your data to desired type with toSet / toArray.
* See Why wrapping a generic method call with Option defers ClassCastException?

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD