Spark: How to write diffrent group values from RDD to different files? - scala

I need to write values with key 1 to file file1.txt and values with key 2 to file2.txt:
val ar = Array (1 -> 1, 1 -> 2, 1 -> 3, 1 -> 4, 1 -> 5, 2 -> 6, 2 -> 7, 2 -> 8, 2 -> 9)
val distAr = sc.parallelize(ar)
val grk = distAr.groupByKey()
How to do this without iterrating collection grk twice?

We write data from different customers to different tables, which is essentially the same usecase. The common pattern we use is something like this:
val customers:List[String] = ???
customers.foreach{customer => rdd.filter(record => belongsToCustomer(record,customer)).saveToFoo()}
This probably does not fulfill the wish of 'not iterating over the rdd twice (or n times)', but filter is a cheap operation to do in a parallel distributed environment and it works, so I think it does comply to the 'general Spark way' of doing things.

Related

Scala: how to check if map has values greater than x?

I was wondering what's the most efficient way to check if a map has a value equal or greater than a certain threshold:
for example, if I get the following map:
val x = Map(2 -> 2, 3 -> 1, 4 -> 1, 7 -> 1)
at the moment I wrote:
x.values.max > 2
Is there a more efficient way?
How to check if the Map contains a specific value instead?
You can use scala.collection.Iterator.exists to check whether there exists at least one item of a collection which satisfies a given predicate. To simplify iterating over the Map, you can use scala.collection.Map.valuesIterator:
x.valuesIterator.exists(_ > 2) //=> false
x.valuesIterator.exists(_ > 1) //=> true
Scastie link

I get the error "This RDD lacks a SparkContext" when I invoke transformations or actions [duplicate]

I get the error message SPARK-5063 in the line of println
val d.foreach{x=> for(i<-0 until x.length)
println(m.lookup(x(i)))}
d is RDD[Array[String]] m is RDD[(String, String)] . Is there any way to print as the way I want? or how can i convert d from RDD[Array[String]] to Array[String] ?
SPARK-5063 relates to better error messages when trying to nest RDD operations, which is not supported.
It's a usability issue, not a functional one. The root cause is the nesting of RDD operations and the solution is to break that up.
Here we are trying a join of dRDD and mRDD. If the size of mRDD is large, a rdd.join would be the recommended way otherwise, if mRDD is small, i.e. fits in memory of each executor, we could collect it, broadcast it and do a 'map-side' join.
JOIN
A simple join would go like this:
val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six")))
val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6))
val flat = rdd.flatMap(_.toSeq).keyBy(x=>x)
val res = flat.join(map).map{case (k,v) => v}
If we would like to use broadcast, we first need to collect the value of the resolution table locally in order to b/c that to all executors. NOTE the RDD to be broadcasted MUST fit in the memory of the driver as well as of each executor.
Map-side JOIN with Broadcast variable
val rdd = sc.parallelize(Seq(Array("one","two","three"), Array("four", "five", "six")))
val map = sc.parallelize(Seq("one" -> 1, "two" -> 2, "three" -> 3, "four" -> 4, "five" -> 5, "six"->6)))
val bcTable = sc.broadcast(map.collectAsMap)
val res2 = rdd.flatMap{arr => arr.map(elem => (elem, bcTable.value(elem)))}
This RDD lacks a SparkContext. It could happen in the following cases:
RDD transformations and actions are NOT invoked by the driver,
but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid
because the values transformation and count action cannot be performed inside of the
rdd1.map transformation

Scala - map function - Only returned last element of a Map

I am new to Scala and trying out the map function on a Map.
Here is my Map:
scala> val map1 = Map ("abc" -> 1, "efg" -> 2, "hij" -> 3)
map1: scala.collection.immutable.Map[String,Int] =
Map(abc -> 1, efg -> 2, hij -> 3)
Here is a map function and the result:
scala> val result1 = map1.map(kv => (kv._1.toUpperCase, kv._2))
result1: scala.collection.immutable.Map[String,Int] =
Map(ABC -> 1, EFG -> 2, HIJ -> 3)
Here is another map function and the result:
scala> val result1 = map1.map(kv => (kv._1.length, kv._2))
result1: scala.collection.immutable.Map[Int,Int] = Map(3 -> 3)
The first map function returns all the members as expected however the second map function returns only the last member of the Map. Can someone explain why this is happening?
Thanks in advance!
In Scala, a Map cannot have duplicate keys. When you add a new key -> value pair to a Map, if that key already exists, you overwrite the previous value. If you're creating maps from functional operations on collections, then you're going to end up with the value corresponding to the last instance of each unique key. In the example you wrote, each string key of the original map map1 has the same length, and so all your string keys produce the same integer key 3 for result1. What's happening under the hood to calculate result1 is:
A new, empty map is created
You map "abc" -> 1 to 3 -> 3 and add it to the map. Result now contains 1 -> 3.
You map "efg" -> 2 to 3 -> 2 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 2 -> 3.
You map "hij" -> 3 to 3 -> 3 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 3 -> 3.
Return the result, which is Map(3 -> 3)`.
Note: I made a simplifying assumption that the order of the elements in the map iterator is the same as the order you wrote in the declaration. The order is determined by hash bin and will probably not match the order you added elements, so don't build anything that relies on this assumption.

Scala comprehension from input

I am new to Scala and I am having troubles constructing a Map from inputs.
Here is my problem :
I am getting an input for elevators information. It consists of n lines, each one has the elevatorFloor number and the elevatorPosition on the floor.
Example:
0 5
1 3
4 5
So here I have 3 elevators, first one is on floor 0 at position 5, second one at floor 1 position 3 etc..
Is there a way in Scala to put it in a Map without using var ?
What I get so far is a Vector of all the elevators' information :
val elevators = {
for{i <- 0 until n
j <- readLine split " "
} yield j.toInt
}
I would like to be able split the lines in two variables "elevatorFloor" and "elevatorPos" and group them in a data structure (my guess is Map would be the appropriate choice) I would like to get something looking like:
elevators: SomeDataStructure[Int,Int] = ( 0->5, 1 -> 3, 4 -> 5)
I would like to clarify that I know I could write Javaish code, initialise a Map and then add the values to it, but I am trying to keep as close to functionnal programming as possible.
Thanks for the help or comments
You can do:
val res: Map[Int, Int] =
Source.fromFile("myfile.txt")
.getLines
.map { line =>
Array(floor, position) = line.split(' ')
(floor.toInt -> position.toInt)
}.toMap

in scala how do we aggregate an Array to determine the count per Key and the percentage vs total

I am trying to find an efficient way to find the following :
Int1 = 1 or 0, Int2 = 1..k (where k = 3) and Double = 1.0
I want to find how many 1 or 0 are there in every k
I need to find the percentage of result of 3 on the total of the size of the Array??
Input is :
val clusterAndLabel = sc.parallelize(Array((0, 0), (0, 0), (1, 0), (1, 1), (2, 1), (2, 1), (2, 0)))
So in this example:
I have : 0,0 = 2 , 0,1 = 0
I have : 1,0 = 1 , 1,1 = 1
I have : 2,1 = 2 , 2,0 = 1
Total is 7 instances
I was thinking of doing some aggegation but I am stuck on the thought that they are both considered 2-key join
If you want to find how many 1 and 0s there are you can do:
val rdd = clusterAndLabel.map(x => (x,1)).reduceByKey(_+_)
this will give you an RDD[(Int,Int),Int] containing exactly what you described, meaning: [((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]. If you really want them gathered by their first key, you can add this line:
val rdd2 = rdd.map(x => (x._1._1, (x._1._2, x._2))).groupByKey()
this will yield an RDD[(Int, (Int,Int)] which will look like what you described, i.e.: [(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])].
If you need the number of instances, it looks like (at least in your example) clusterAndLabel.count() should do the work.
I don't really understand question 3? I can see two things:
you want to know how many keys have 3 occurrences. To do so, you can start from the object I called rdd (no need for the groupByKey line) and do so:
val rdd3 = rdd.map(x => (x._2,1)).reduceByKey(_+_)
this will yield and RDD[(Int,Int)] which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: [(1,3),(2,2)]. So if you want to know how many pairs occur 3 times, you just do rdd3.filter(_._1==3).collect() (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).
you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from rdd2 and do:
val rdd3 = rdd2.map(x=>(x._1,x._2.size)).filter(_._1==3).collect()
once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display rdd2, you can just do:
val rdd4 = rdd.map(x => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()
(for performance you might want to do the filter before reduceByKey also!)