How to set Map values in spark/scala - scala

I am new to spark-scala development. I am trying to create map values in spark using scala but getting nothing printed
def createMap() : Map[String, Int] = {
var tMap:Map[String, Int] = Map()
val tDF = spark.sql("select a, b, c from temp")
for (x <- tDF) {
val k = x.getAs[Long](0) + "|" + x.getAs[Long](1)
val v = x.getAs[Int](2)
tMap += ( k -> v )
println( k -> v ) ///----------This print values
}
println("Hellllooooooooo1")
for ((k,v) <- tMap) println("key = " + k+ ", value= " + v) ////------This prints nothing
println("Hellllooooooooo2")
return tMap
}
Please suggest.

user8598832 gives how to do it properly (for some value of properly). The reason your approach doesn't work is that you're adding (k, v) to the map in an executor, but the println occurs in the driver, which generally won't see the map(s) in the executor(s) (to the extent that it might, that's just an artifact of running it in local mode not in a distributed mode).

The "right" (if collecting to driver is ever right) way to do it:
import org.apache.spark.sql.functions._
tDF.select(concat_ws("|", col("a"), col("b")), col("c")).as[(String, Int)].rdd.collectAsMap

Related

Spark - aggregateByKey Type mismatch error

I am trying find the problem behind this. I am trying to find the maximum number Marks of each student using aggregateByKey.
val data = spark.sc.Seq(("R1","M",22),("R1","E",25),("R1","F",29),
("R2","M",20),("R2","E",32),("R2","F",52))
.toDF("Name","Subject","Marks")
def seqOp = (acc:Int,ele:(String,Int)) => if (acc>ele._2) acc else ele._2
def combOp =(acc:Int,acc1:Int) => if(acc>acc1) acc else acc1
val r = data.rdd.map{case(t1,t2,t3)=> (t1,(t2,t3))}.aggregateByKey(0)(seqOp,combOp)
I am getting error that aggregateByKey accepts (Int,(Any,Any)) but actual is (Int,(String,Int)).
Your map function is incorrect since you have a Row as input, not a Tuple3
Fix the last line with :
val r = data.rdd.map { r =>
val t1 = r.getAs[String](0)
val t2 = r.getAs[String](1)
val t3 = r.getAs[Int](2)
(t1,(t2,t3))
}.aggregateByKey(0)(seqOp,combOp)

Expand a Sequence grouped by keys into a list of ungrouped sequences with keys attached in Scala

I have the following object in Scala:
List[(String,Map[String, Seq[(Int, Double)]])]
And I would like to convert it into a sequence of individual rows where each row of the sequence has 4 terms: (String,String,Int,Double).
For instance, if I had the following data:
List(
("SuperGroup1", Map("SubGroup1" -> Seq((17,24.1),(38,39.2)))),
("SuperGroup1", Map("SubGroup2" -> Seq((135,302.3),(938,887.4))))
)
I want to turn it into:
Seq(
("SuperGroup1","SubGroup1",17,24.1),
("SuperGroup1","SubGroup1",38,39.2),
("SuperGroup1","SubGroup2",135,302.3),
("SuperGroup1","SubGroup2",938,887.4)
)
I figure you can use flatMap or something like that, but I am not exactly sure how that would work. I see that RDDs have a function called flatMapValues, but what about for just a standard list/map combination like I have?
Given
val x = List(
("SuperGroup1",Map("SubGroup1" -> Seq((17,24.1),(38,39.2)))),
("SuperGroup1",Map("SubGroup2" -> Seq((135,302.3),(938,887.4))))
)
you obtain the list that you want from
for ((s, m) <- x; (k, vs) <- m; (i, f) <- vs) yield (s, k, i, f)
Result:
List(
(SuperGroup1,SubGroup1,17,24.1),
(SuperGroup1,SubGroup1,38,39.2),
(SuperGroup1,SubGroup2,135,302.3),
(SuperGroup1,SubGroup2,938,887.4)
)
Given the types you have and the following input:
val input = Seq(
("SuperGroup1", Map("SubGroup1" -> Seq(((17,24.1),(38,39.2))))),
("SuperGroup1", Map("SubGroup2" -> Seq(((135,302.3),(938,887.4)))))
)
This would transform your input into the form you expect
input.flatMap { superGroupBox =>
superGroupBox._2.toSeq.flatMap { subGroupBox =>
subGroupBox._2.flatMap(x => Seq(x._1, x._2).map(numericTuple => (superGroupBox._1, subGroupBox._1, numericTuple._1, numericTuple._2)))
}
}

Failed to print get count result in forloop

I have been trying to count inside a for loop, but the result just ends with a parentheses. I am just printing out the key here in map.
var count = 0
xs.foreach(x => (myMap += ((count+=1).toString+","+java.util.UUID.randomUUID.toString -> x)))
Output:
(),901e9926-be1e-4dc4-b3e3-6c3b2feea2c4
Expected output:
1,901e9926-be1e-4dc4-b3e3-6c3b2feea2c4
Within your foreach, count += 1 would be of type Unit. If I understand your question correctly, the example below (using an arbitrary xs collection) might be what you're looking for:
val xs = List("a", "b", "c", "d")
var count = 0
var myMap = Map[String, String]()
xs.foreach{ x =>
count += 1
myMap += ((count.toString + "," + java.util.UUID.randomUUID.toString) -> x)
}
myMap.keys
// res1: Iterable[String] = Set(
// 1,bd971c44-b9d0-41a0-b59f-3acbf2e0dee0, 2,5459eed9-309d-4f9c-afd7-10aced9df2a0,
// 3,5816ea42-d8ed-4beb-8b30-0376d0674700, 4,30f6f22f-1e6d-4eec-86af-5bc6734d5196
// )
In case you want a more idiomatic approach, using zip for the count and foldLeft for Map aggregation would produce similar result:
val myMap = Map[String, String]()
val resultMap = xs.zip(Stream from 1).foldLeft( myMap )(
(m, x) => m + ((x._2.toString + "," + java.util.UUID.randomUUID.toString) -> x._1)
)
What you are printing here is actually (count+=1).toString. In Scala, an assignment like this will be evaluated to Unit, which is expressed by parentheses. That's why you print () and not the value of count. If you check the count variable value afterwards you will see that it is 1 as expected.
Additionally, what you are trying to do could be expressed in a better way, e.g, you could do:
val myMap = xs.zipWithIndex.map(x => (x._2 + 1) + "," + java.util.UUID.randomUUID -> x._1).toMap

Finding values within broadcast variable

I want to join two sets by applying broadcast variable. I am trying to implement the first suggestion from Spark: what's the best strategy for joining a 2-tuple-key RDD with single-key RDD?
val emp_newBC = sc.broadcast(emp_new.collectAsMap())
val joined = emp.mapPartitions({ iter =>
val m = emp_newBC.value
for {
((t, w)) <- iter
if m.contains(t)
} yield ((w + '-' + m.get(t).get),1)
}, preservesPartitioning = true)
However as mentioned here: broadcast variable fails to take all data I need to use collect() rather than collectAsMAp(). I tried to adjust my code as below:
val emp_newBC = sc.broadcast(emp_new.collect())
val joined = emp.mapPartitions({ iter =>
val m = emp_newBC.value
for {
((t, w)) <- iter
if m.contains(t)
amk = m.indexOf(t)
} yield ((w + '-' + emp_newBC.value(amk)),1) //yield ((t, w), (m.get(t).get)) //((w + '-' + m.get(t).get),1)
}, preservesPartitioning = true)
But it seems m.contains(t) does not respond. How can I remedy this?
Thanks in advance.
How about something like this?
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
Regarding your code... As far as I understand emp_new is a RDD[(String, String)]. When it is collected you get an Array[(String, String)]. When you use
((t, w)) <- iter
t is a String so m.contains(t) will always return false.
Another problem I see is preservesPartitioning = true inside mapPartitions. There a few possible scenarios:
emp is partitioned and you want joined to be partitioned as well. Since you change key from t to some new value partitioning cannot be preserved and resulting RDD has to be repartitioned. If you use preservesPartitioning = true output RDD will end up with wrong partitions.
emp is partitioned but you don't need partitioning for joined. There is no reason to use preservesPartitioning.
emp is not partitioned. Setting preservesPartitioning has no effect.

Applying operation to corresponding elements of Array

I want to sum the corresponding elements of the list and multiply the results while keeping the label associated with the array element so
("a",Array((0.5,1.0),(0.667,2.0)))
becomes :
(a , (0.5 + 0.667) * (1.0 + 2.0))
Here is my code to express this for a single array element :
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
//> data : Array[(String, Array[(Double, Double)])] = Array((a,Array((0.5,1.0),
//| (0.667,2.0))), (b,Array((0.6,2.0), (0.6,2.0))))
val v1 = (data(0)._1, data(0)._2.map(m => m._1).sum)
//> v1 : (String, Double) = (a,1.167)
val v2 = (data(0)._1, data(0)._2.map(m => m._2).sum)
//> v2 : (String, Double) = (a,3.0)
val total = (v1._1 , (v1._2 * v2._2)) //> total : (String, Double) = (a,3.5010000000000003)
I just want apply this function to all elements of the array so val "data" above becomes :
Map[(String, Double)] = ((a,3.5010000000000003),(b,4.8))
But I'm not sure how to combine the above code into a single function which maps over all the array elements ?
Update : the inner Array can be of variable length so this is also valid :
val data = Array(("a",Array((0.5,1.0,2.0),(0.667,2.0,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
Pattern matching is your friend! You can use it for tuples and arrays. If there are always two elements in the inner array, you can do it this way:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
data.map {
case (s, Array((x1, x2), (x3, x4))) => s -> (x1 + x3) * (x2 + x4)
}
// Array[(String, Double)] = Array((a,3.5010000000000003), (b,4.8))
res6.toMap
// scala.collection.immutable.Map[String,Double] = Map(a -> 3.5010000000000003, b -> 4.8)
If the inner elements are variable length, you could do it this way (a for comprehension instead of explicit maps):
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).sum
sum2 = tuples.map(_._2).sum
} yield s -> sum1 * sum2
Note that while this is a very clear solution, it's not the most efficient possible, because we're iterating over the tuples twice. You could use a fold instead, but it would be much harder to read (for me anyway. :)
Finally, note that .sum will produce zero on an empty collection. If that's not what you want, you could do this instead:
val emptyDefault = 1.0 // Or whatever, depends on your use case
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).reduceLeftOption(_ + _).getOrElse(emptyDefault)
sum2 = tuples.map(_._2).reduceLeftOption(_ + _).getOrElse(emptyDefault)
} yield s -> sum1 * sum2
You can use algebird numeric library:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
import com.twitter.algebird.Operators._
def sumAndProduct(a: Array[(Double, Double)]) = {
val sums = a.reduceLeft((m, n) => m + n)
sums._1 * sums._2
}
data.map{ case (x, y) => (x, sumAndProduct(y)) }
// Array((a,3.5010000000000003), (b,4.8))
It will work fine for variable size array as well.
val data = Array(("a",Array((0.5,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
// Array((a,0.5), (b,4.8))
Like this? Does your array always have only 2 pairs?
val m = data map ({case (label,Array(a,b)) => (label, (a._1 + b._1) * (a._2 + b._2)) })
m.toMap