I have an RDD of type:
dataset :org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionRDD[26]
Which is equivalent to (Pedro, 0.0833), (Hello, 0.001828) ...
I'd like to sum all the value , 0.0833+0.001828..

Considering your input data, you can do the following :
// example
val datasets = sc.parallelize(List(("Pedro", 0.0833), ("Hello", 0.001828)))
// res3: Double = 0.085128
// or + _)
// res4: Double = 0.085128
// or even
// res5: Double = 0.085128

like this?:
map(_._2).reduce((x, y) => x + y)
breakdown: map the tuple to just the double values, then reduce the RDD by summing.


Compare two columns which is list with n number of values and save the data as a list in scala

I need to perform comparison operation (like greater than or less than) on two columns which is list with n number of values (values are nothing but timestamp) and my result should also be in list.
How can I do this operation?
Date1 Date2
["2016-11-24 12:06:47"] ["2017-10-04 03:30:23"]
["null"] []
["2017-01-25 10:07:25","2018-01-25 10:07:25"] ["2017-09-15 03:30:16","2017-09-15 03:30:16"]
Output should be:
["Not Okay"]
I need to perform comparison operation
It seems you are looking for the .compareTo operator:
scala> "a".compareTo("b")
res: Int = -1
scala> "a".compareTo("a")
res: Int = 0
scala> "b".compareTo("a")
res: Int = 1
Using the first example mentioned:
val date1 = "2016-11-24 12:06:47"
val date2 = "2017-10-04 03:30:23"
scala> date1.compareTo(date2)
res: Int = -1
If we ignore for a moment the "Not Okay" case, we could implement the "Less" or "Great" cases with a function like:
def compareLexicographically(s1: String, s2: String): String = s1.compareTo(s2) match {
case -1 => "Less"
case _ => "Great"
Looking at your example, I am assuming the rows are tuples of list of Strings:
val rows: List[(List[String], List[String])] =
List("2016-11-24 12:06:47"),
List("2017-10-04 03:30:23")
List("2017-01-25 10:07:25", "2018-01-25 10:07:25"),
List("2017-09-15 03:30:16", "2017-09-15 03:30:16")
I would first zip the elements from the columns to get List[(String, String)]
rows.flatMap(r =>
Then simple map with compareLexicographically:
scala> rows.flatMap(r => _).tupled)
res: List[String] = List(Less, Great, Great)

Finding length of string in Scala

I am a newbie to scala
I have a list of strings -
List[String] (“alpha”, “gamma”, “omega”, “zeta”, “beta”)
I want to count all the strings with length = 4
i.e I want to get output = 2.
You can do like this:
val data = List("alpha", "gamma", "omega", "zeta", "beta")
data.filter(x => x.length == 4).size
res8: Int = 2
You can just use count function as
val list = List[String] ("alpha", "gamma", "omega", "zeta", "beta")
println(list.count(x => x.length == 4))
//2 is printed
I hope the answer is helpful

ReduceLeft with Vector of pairs?

I have a Vector of Pairs
Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
and I want to return pair with maximum first element in pair.
I've tried it to do like this:
data reduceLeft[(Int, Int)]((y:(Int, Int),z:(Int,Int))=>y._1 max z._1)
data reduceLeft((y:(Int, Int),z:(Int,Int))=>y._1 max z._1)
but there is type mismatch error and I can't understand what is wrong with this code.
Why using reduceLeft ?
Just the default max method works very well
scala> val v = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
v: scala.collection.immutable.Vector[(Int, Int)] = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
scala> v.max
res1: (Int, Int) = (25,5)
If you want reduceLeft instead :
v.reduceLeft( (x, y) => if (x._1 >= y._1) x else y )
Your error is you have to return a tuple, not an int
y._1 max z._1
The max function here on two int return an int.
max works great in this example. However, if you are wondering how to do this using reduceLeft here it is:
val v = Vector((9,1), (16,2), (21,3), (24,4), (25,5), (24,6), (21,7), (16,8), (9,9), (0,10))
v.reduceLeft( ( x:(Int, Int), y:(Int,Int) ) => if(y._1 > x._1) y else x)

How to find max value in pair RDD?

I have a spark pair RDD (key, count) as below
Array[(String, Int)] = Array((a,1), (b,2), (c,1), (d,3))
How to find the key with highest count using spark scala API?
EDIT: datatype of pair RDD is org.apache.spark.rdd.RDD[(String, Int)]
Use Array.maxBy method:
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)
or RDD.max:
val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
override def compare(x: (String, Int), y: (String, Int)): Int =
Ordering[Int].compare(x._2, y._2)
Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))
Quoting the note from RDD.takeOrdered:
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
For Pyspark:
Let a be the pair RDD with keys as String and values as integers then
a.max(lambda x:x[1])
returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.
Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element.
Note that the max function by itself will order by key and return the max value.
Documentation is available at
Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays
strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)

Applying operation to corresponding elements of Array

I want to sum the corresponding elements of the list and multiply the results while keeping the label associated with the array element so
becomes :
(a , (0.5 + 0.667) * (1.0 + 2.0))
Here is my code to express this for a single array element :
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
//> data : Array[(String, Array[(Double, Double)])] = Array((a,Array((0.5,1.0),
//| (0.667,2.0))), (b,Array((0.6,2.0), (0.6,2.0))))
val v1 = (data(0)._1, data(0) => m._1).sum)
//> v1 : (String, Double) = (a,1.167)
val v2 = (data(0)._1, data(0) => m._2).sum)
//> v2 : (String, Double) = (a,3.0)
val total = (v1._1 , (v1._2 * v2._2)) //> total : (String, Double) = (a,3.5010000000000003)
I just want apply this function to all elements of the array so val "data" above becomes :
Map[(String, Double)] = ((a,3.5010000000000003),(b,4.8))
But I'm not sure how to combine the above code into a single function which maps over all the array elements ?
Update : the inner Array can be of variable length so this is also valid :
val data = Array(("a",Array((0.5,1.0,2.0),(0.667,2.0,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
Pattern matching is your friend! You can use it for tuples and arrays. If there are always two elements in the inner array, you can do it this way:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0)))) {
case (s, Array((x1, x2), (x3, x4))) => s -> (x1 + x3) * (x2 + x4)
// Array[(String, Double)] = Array((a,3.5010000000000003), (b,4.8))
// scala.collection.immutable.Map[String,Double] = Map(a -> 3.5010000000000003, b -> 4.8)
If the inner elements are variable length, you could do it this way (a for comprehension instead of explicit maps):
for {
(s, tuples) <- data
sum1 =
sum2 =
} yield s -> sum1 * sum2
Note that while this is a very clear solution, it's not the most efficient possible, because we're iterating over the tuples twice. You could use a fold instead, but it would be much harder to read (for me anyway. :)
Finally, note that .sum will produce zero on an empty collection. If that's not what you want, you could do this instead:
val emptyDefault = 1.0 // Or whatever, depends on your use case
for {
(s, tuples) <- data
sum1 = + _).getOrElse(emptyDefault)
sum2 = + _).getOrElse(emptyDefault)
} yield s -> sum1 * sum2
You can use algebird numeric library:
val data = Array(("a",Array((0.5,1.0),(0.667,2.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
import com.twitter.algebird.Operators._
def sumAndProduct(a: Array[(Double, Double)]) = {
val sums = a.reduceLeft((m, n) => m + n)
sums._1 * sums._2
}{ case (x, y) => (x, sumAndProduct(y)) }
// Array((a,3.5010000000000003), (b,4.8))
It will work fine for variable size array as well.
val data = Array(("a",Array((0.5,1.0))), ("b",Array((0.6,2.0), (0.6,2.0))))
// Array((a,0.5), (b,4.8))
Like this? Does your array always have only 2 pairs?
val m = data map ({case (label,Array(a,b)) => (label, (a._1 + b._1) * (a._2 + b._2)) })