Sum values of PairRDD - scala

I have an RDD of type:
dataset :org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionRDD[26]
Which is equivalent to (Pedro, 0.0833), (Hello, 0.001828) ...
I'd like to sum all the value , 0.0833+0.001828.. but I can't find a proper
solution.

Considering your input data, you can do the following :
// example
val datasets = sc.parallelize(List(("Pedro", 0.0833), ("Hello", 0.001828)))
datasets.map(_._2).sum()
// res3: Double = 0.085128
// or
datasets.map(_._2).reduce(_ + _)
// res4: Double = 0.085128
// or even
datasets.values.sum()
// res5: Double = 0.085128

like this?:
map(_._2).reduce((x, y) => x + y)
breakdown: map the tuple to just the double values, then reduce the RDD by summing.

Related

Compare two columns which is list with n number of values and save the data as a list in scala

I need to perform comparison operation (like greater than or less than) on two columns which is list with n number of values (values are nothing but timestamp) and my result should also be in list.
How can I do this operation?
Input:
Date1 Date2
["2016-11-24 12:06:47"] ["2017-10-04 03:30:23"]
["null"] []
["2017-01-25 10:07:25","2018-01-25 10:07:25"] ["2017-09-15 03:30:16","2017-09-15 03:30:16"]
Output should be:
Result
["Less"]
["Not Okay"]
["Less","Great"]
I need to perform comparison operation
It seems you are looking for the .compareTo operator:
scala> "a".compareTo("b")
res: Int = -1
scala> "a".compareTo("a")
res: Int = 0
scala> "b".compareTo("a")
res: Int = 1
Using the first example mentioned:
val date1 = "2016-11-24 12:06:47"
val date2 = "2017-10-04 03:30:23"
scala> date1.compareTo(date2)
res: Int = -1
If we ignore for a moment the "Not Okay" case, we could implement the "Less" or "Great" cases with a function like:
def compareLexicographically(s1: String, s2: String): String = s1.compareTo(s2) match {
case -1 => "Less"
case _ => "Great"
}
Looking at your example, I am assuming the rows are tuples of list of Strings:
val rows: List[(List[String], List[String])] =
List((
List("2016-11-24 12:06:47"),
List("2017-10-04 03:30:23")
),
(
List("2017-01-25 10:07:25", "2018-01-25 10:07:25"),
List("2017-09-15 03:30:16", "2017-09-15 03:30:16")
))
I would first zip the elements from the columns to get List[(String, String)]
rows.flatMap(r => r._1.zip(r._2))
Then simple map with compareLexicographically:
scala> rows.flatMap(r => r._1.zip(r._2)).map((compareLexicographically _).tupled)
res: List[String] = List(Less, Great, Great)

Finding length of string in Scala

I am a newbie to scala
I have a list of strings -
List[String] (“alpha”, “gamma”, “omega”, “zeta”, “beta”)
I want to count all the strings with length = 4
i.e I want to get output = 2.
You can do like this:
val data = List("alpha", "gamma", "omega", "zeta", "beta")
data.filter(x => x.length == 4).size
res8: Int = 2
You can just use count function as
val list = List[String] ("alpha", "gamma", "omega", "zeta", "beta")
println(list.count(x => x.length == 4))
//2 is printed
I hope the answer is helpful

ReduceLeft with Vector of pairs?

I have a Vector of Pairs
Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
and I want to return pair with maximum first element in pair.
I've tried it to do like this:
data reduceLeft[(Int, Int)]((y:(Int, Int),z:(Int,Int))=>y._1 max z._1)
and
data reduceLeft((y:(Int, Int),z:(Int,Int))=>y._1 max z._1)
but there is type mismatch error and I can't understand what is wrong with this code.
Why using reduceLeft ?
Just the default max method works very well
scala> val v = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
v: scala.collection.immutable.Vector[(Int, Int)] = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
scala> v.max
res1: (Int, Int) = (25,5)
If you want reduceLeft instead :
v.reduceLeft( (x, y) => if (x._1 >= y._1) x else y )
Your error is you have to return a tuple, not an int
y._1 max z._1
The max function here on two int return an int.
max works great in this example. However, if you are wondering how to do this using reduceLeft here it is:
val v = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
v.reduceLeft( ( x:(Int, Int), y:(Int,Int) ) => if(y._1 > x._1) y else x)

How to find max value in pair RDD?

I have a spark pair RDD (key, count) as below
Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))
How to find the key with highest count using spark scala API?
EDIT: datatype of pair RDD is org.apache.spark.rdd.RDD[(String, Int)]
Use Array.maxBy method:
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)
or RDD.max:
val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
override def compare(x: (String, Int), y: (String, Int)): Int =
Ordering[Int].compare(x._2, y._2)
})
Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))
Quoting the note from RDD.takeOrdered:
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
For Pyspark:
Let a be the pair RDD with keys as String and values as integers then
a.max(lambda x:x[1])
returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.
Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element.
Note that the max function by itself will order by key and return the max value.
Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max
Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays
strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)

Applying operation to corresponding elements of Array

I want to sum the corresponding elements of the list and multiply the results while keeping the label associated with the array element so
("a",Array((0.5,1.0),(0.667,2.0)))
becomes :
(a , (0.5 + 0.667) * (1.0 + 2.0))
Here is my code to express this for a single array element :
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
//> data : Array[(String, Array[(Double, Double)])] = Array((a,Array((0.5,1.0),
//| (0.667,2.0))), (b,Array((0.6,2.0), (0.6,2.0))))
val v1 = (data(0)._1, data(0)._2.map(m => m._1).sum)
//> v1 : (String, Double) = (a,1.167)
val v2 = (data(0)._1, data(0)._2.map(m => m._2).sum)
//> v2 : (String, Double) = (a,3.0)
val total = (v1._1 , (v1._2 * v2._2)) //> total : (String, Double) = (a,3.5010000000000003)
I just want apply this function to all elements of the array so val "data" above becomes :
Map[(String, Double)] = ((a,3.5010000000000003),(b,4.8))
But I'm not sure how to combine the above code into a single function which maps over all the array elements ?
Update : the inner Array can be of variable length so this is also valid :
val data = Array(("a",Array((0.5,1.0,2.0),(0.667,2.0,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
Pattern matching is your friend! You can use it for tuples and arrays. If there are always two elements in the inner array, you can do it this way:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
data.map {
case (s, Array((x1, x2), (x3, x4))) => s -> (x1 + x3) * (x2 + x4)
}
// Array[(String, Double)] = Array((a,3.5010000000000003), (b,4.8))
res6.toMap
// scala.collection.immutable.Map[String,Double] = Map(a -> 3.5010000000000003, b -> 4.8)
If the inner elements are variable length, you could do it this way (a for comprehension instead of explicit maps):
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).sum
sum2 = tuples.map(_._2).sum
} yield s -> sum1 * sum2
Note that while this is a very clear solution, it's not the most efficient possible, because we're iterating over the tuples twice. You could use a fold instead, but it would be much harder to read (for me anyway. :)
Finally, note that .sum will produce zero on an empty collection. If that's not what you want, you could do this instead:
val emptyDefault = 1.0 // Or whatever, depends on your use case
for {
(s, tuples) <- data
sum1 = tuples.map(_._1).reduceLeftOption(_ + _).getOrElse(emptyDefault)
sum2 = tuples.map(_._2).reduceLeftOption(_ + _).getOrElse(emptyDefault)
} yield s -> sum1 * sum2
You can use algebird numeric library:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
import com.twitter.algebird.Operators._
def sumAndProduct(a: Array[(Double, Double)]) = {
val sums = a.reduceLeft((m, n) => m + n)
sums._1 * sums._2
}
data.map{ case (x, y) => (x, sumAndProduct(y)) }
// Array((a,3.5010000000000003), (b,4.8))
It will work fine for variable size array as well.
val data = Array(("a",Array((0.5,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
// Array((a,0.5), (b,4.8))
Like this? Does your array always have only 2 pairs?
val m = data map ({case (label,Array(a,b)) => (label, (a._1 + b._1) * (a._2 + b._2)) })
m.toMap