add values from struct keys spark - scala

I have the following code:
val df2 = df.withColumn("col", expr("transform(col, x -> struct(x.amt as amt))"))
Output: [{"amt": 10000}, {"amt": 20000}]
I want to add all the values for amt key. So I am getting all the values into a list as below:
df.withColumn("list_val", expr("transform(col, x -> x.amt)"))
Output: [10000,20000]
To sum the values, I have the following code, but getting error cannot resolve aggregate
.withColumn("amount", aggregate($"list_val", lit(0), (x, y) => (x + y)))
How do I fix this code or is there any better way to add the values?

aggregate should be used inside a Spark SQL expr for Spark 2.4. Also it should be better to add a type cast to ensure there is no type mismatch:
df.withColumn("amount", expr("aggregate(list_val, 0, (x, y) -> (x + int(y)))")
// for float type; for double type, replace "float" with "double"
df.withColumn("amount", expr("aggregate(list_val, float(0), (x, y) -> (x + float(y)))")
In Scala API that would be
df.withColumn("amount", aggregate($"list_val", lit(0), (x, y) => (x + int(y))))
df.withColumn("amount", aggregate($"list_val", lit(0f), (x, y) => (x + float(y))))
df.withColumn("amount", aggregate($"list_val", lit(0.0), (x, y) => (x + double(y))))

Related

How to sum up pair elements individually in Scala

I have the following method to sum up the pair elements in an array of pairs. I am new to scala and feel like there will be a better way than the following piece of code.
def accumulate(results: Array[(Int, Int)]): (Int, Int) = {
var x: Int = 0
var y: Int = 0
for (elem <- results) {
x = x + elem._1
y = y + elem._2
}
(x, y)
}
Yes, you can use foldLeft.
(BTW, I would also use List, instead of Array)
results.foldLeft((0, 0)) {
case ((accX, accY), (x, y)) =>
(accX + x, accY + y)
}
All of the operations in scala.collection.ArrayOps are available on Array[T]. In particular, you can unzip an array of pairs into a pair of arrays
val (xs, ys) = results.unzip
Summing a container is a standard use of fold
val x = xs.fold(0)(_ + _)
val y = ys.fold(0)(_ + _)
And then you can return the pair of values
(x, y)
https://scalafiddle.io/sf/meEKv6T/0 has a complete working example.

scala parallel collections not consistent

I am getting inconsistent answers from the following code which I find odd.
import scala.math.pow
val p = 2
val a = Array(1,2,3)
println(a.par
.aggregate("0")((x, y) => s"$y pow $p; ", (x, y) => x + y))
for (i <- 1 to 100) {
println(a.par
.aggregate(0.0)((x, y) => pow(y, p), (x, y) => x + y) == 14)
}
a.map(x => pow(x,p)).sum
In the code the a.par ... computes 14 or 10. Can anyone provide an explanation for why it is computing inconsistently?
In your "seqop" function, that is the first function you pass to aggregate, you define the logic that is used to combine elements within the same partition. Your function looks like this:
(x, y) => pow(y, p)
The problem is that you don't accumulate the results of a partition. Instead, you throw away your accumulator x. Every time you get 10 as a result, the calculation 2^2 was dropped.
If you change your function to take the accumulated value into account, you will get 14 every time:
(x, y) => x + pow(y, p)
The correct way to use aggregate is
a.par.aggregate(0.0)(
(acc, value) => acc + pow(value, 2), (acc1, acc2) => acc1 + acc2
)
By using (x,y) => pow(y,2) , you did not accumulate the item to the accumulator but just replaced the accumulator by pow(y,2).

Scala - error: type not found

I am a newbie in Scala and I have an error that i cannot understand. Here is my array of int : (numbers from 1 to 100)
val rdd = sc.parallelize(1 to 100)
Next I wrote a function, which is returning the MAX value of my array:
rdd.reduce((x, y) => x > y ? x : y)
But I always get this error:
<console>:30: error: not found: type y
rdd.reduce((x, y) => x > y ? x : y)
^
I don't really know what the error means so i can't find a solution. But if i use my function like this, it works:
rdd.reduce((x, y) => if(x > y) x else y)
Thank you for your answers !
There is no ? : operator in Scala, use if instead:
rdd.reduce((x, y) => if (x > y) x else y)
Or use max instead of building it on your own:
rdd.reduce((x, y) => x max y)
Or with _ syntax for anonymous function:
rdd.reduce(_ max _)
Or avoid building collection max on your own:
rdd.max

Scala: apply Map to a list of tuples

very simple question: I want to do something like this:
var arr1: Array[Double] = ...
var arr2: Array[Double] = ...
var arr3: Array[(Double,Double)] = arr1.zip(arr2)
arr3.foreach(x => {if (x._1 > treshold) {x._2 = x._2 * factor}})
I tried a lot differnt syntax versions, but I failed with all of them. How could I solve this? It can not be very difficult ...
Thanks!
Multiple approaches to solve this, consider for instance the use of collect which delivers an immutable collection arr4, as follows,
val arr4 = arr3.collect {
case (x, y) if x > threshold => (x ,y * factor)
case v => v
}
With a for comprehension like this,
for ((x, y) <- arr3)
yield (x, if (x > threshold) y * factor else y)
I think you want to do something like
scala> val arr1 = Array(1.1, 1.2)
arr1: Array[Double] = Array(1.1, 1.2)
scala> val arr2 = Array(1.1, 1.2)
arr2: Array[Double] = Array(1.1, 1.2)
scala> val arr3 = arr1.zip(arr2)
arr3: Array[(Double, Double)] = Array((1.1,1.1), (1.2,1.2))
scala> arr3.filter(_._1> 1.1).map(_._2*2)
res0: Array[Double] = Array(2.4)
I think there are two problems:
You're using foreach, which returns Unit, where you want to use map, which returns an Array[B].
You're trying to update an immutable value, when you want to return a new, updated value. This is the difference between _._2 = _._2 * factor and _._2 * factor.
To filter the values not meeting the threshold:
arr1.zip(arr2).filter(_._1 > threshold).map(_._2 * factor)
To keep all values, but only multiply the ones meeting the threshold:
arr1.zip(arr2).map {
case (x, y) if x > threshold => y * factor
case (_, y) => y
}
You can do it with this,
arr3.map(x => if (x._1 > threshold) (x._1, x._2 * factor) else x)
How about this?
arr3.map { case(x1, x2) => // extract first and second value
if (x1 > treshold) (x1, x2 * factor) // if first value is greater than threshold, 'change' x2
else (x1, x2) // otherwise leave it as it is
}.toMap
Scala is generally functional, which means you do not change values, but create new values, for example you do not write x._2 = …, since tuple is immutable (you can't change it), but create a new tuple.
This will do what you need.
arr3.map(x => if(x._1 > treshold) (x._1, x._2 * factor) else x)
The key here is that you can return tuple from the map lambda expression by putting two variable into (..).
Edit: You want to change every element of an array without creating a new array. Then you need to do the next.
arr3.indices.foreach(x => if(arr3(x)._1 > treshold) (arr3(x)._1, arr3(x)._2 * factor) else x)

syntax explanation for pattern matching a list in scala

I was reading this blog post and i was not able to understand a part of the code.
object O {
def maximum(x: List[Int]): Int = x match {
case Nil => error("maximum undefined for empty list")
case x :: y :: ys => maximum((if(x > y) x else y) :: ys)
case x :: _ => x
}
}
Please explain the code maximum((if(x > y) x else y) :: ys)
How the if condition can be a part of the method maximum ?
I understand that if condition is not exactly a parameter.
In Scala, if is an expression, not a statement.
Try this in the REPL:
scala> val x=1; val y=0
x: Int = 1
y: Int = 0
scala> val test=if(x > y) x else y
test: Int = 1
if evaluates to 1 and 1 is assigned to test. In Java if could be expressed with the conditional operator (x > y) ? x : y
Now, you have a function called maximum that takes a List[Int] as a parameter.
maximum((if(x > y) x else y) :: ys) calls maximum (recursively) with a list obtained prepending one between x and y (depending on what the if evaluates to) to ys.