I have the following code:
val df2 = df.withColumn("col", expr("transform(col, x -> struct(x.amt as amt))"))
Output: [{"amt": 10000}, {"amt": 20000}]
I want to add all the values for amt key. So I am getting all the values into a list as below:
df.withColumn("list_val", expr("transform(col, x -> x.amt)"))
Output: [10000,20000]
To sum the values, I have the following code, but getting error cannot resolve aggregate
.withColumn("amount", aggregate($"list_val", lit(0), (x, y) => (x + y)))
How do I fix this code or is there any better way to add the values?
aggregate should be used inside a Spark SQL expr for Spark 2.4. Also it should be better to add a type cast to ensure there is no type mismatch:
df.withColumn("amount", expr("aggregate(list_val, 0, (x, y) -> (x + int(y)))")
// for float type; for double type, replace "float" with "double"
df.withColumn("amount", expr("aggregate(list_val, float(0), (x, y) -> (x + float(y)))")
In Scala API that would be
df.withColumn("amount", aggregate($"list_val", lit(0), (x, y) => (x + int(y))))
df.withColumn("amount", aggregate($"list_val", lit(0f), (x, y) => (x + float(y))))
df.withColumn("amount", aggregate($"list_val", lit(0.0), (x, y) => (x + double(y))))
very simple question: I want to do something like this:
var arr1: Array[Double] = ...
var arr2: Array[Double] = ...
var arr3: Array[(Double,Double)] = arr1.zip(arr2)
arr3.foreach(x => {if (x._1 > treshold) {x._2 = x._2 * factor}})
I tried a lot differnt syntax versions, but I failed with all of them. How could I solve this? It can not be very difficult ...
Thanks!
Multiple approaches to solve this, consider for instance the use of collect which delivers an immutable collection arr4, as follows,
val arr4 = arr3.collect {
case (x, y) if x > threshold => (x ,y * factor)
case v => v
}
With a for comprehension like this,
for ((x, y) <- arr3)
yield (x, if (x > threshold) y * factor else y)
I think you want to do something like
scala> val arr1 = Array(1.1, 1.2)
arr1: Array[Double] = Array(1.1, 1.2)
scala> val arr2 = Array(1.1, 1.2)
arr2: Array[Double] = Array(1.1, 1.2)
scala> val arr3 = arr1.zip(arr2)
arr3: Array[(Double, Double)] = Array((1.1,1.1), (1.2,1.2))
scala> arr3.filter(_._1> 1.1).map(_._2*2)
res0: Array[Double] = Array(2.4)
I think there are two problems:
You're using foreach, which returns Unit, where you want to use map, which returns an Array[B].
You're trying to update an immutable value, when you want to return a new, updated value. This is the difference between _._2 = _._2 * factor and _._2 * factor.
To filter the values not meeting the threshold:
arr1.zip(arr2).filter(_._1 > threshold).map(_._2 * factor)
To keep all values, but only multiply the ones meeting the threshold:
arr1.zip(arr2).map {
case (x, y) if x > threshold => y * factor
case (_, y) => y
}
You can do it with this,
arr3.map(x => if (x._1 > threshold) (x._1, x._2 * factor) else x)
How about this?
arr3.map { case(x1, x2) => // extract first and second value
if (x1 > treshold) (x1, x2 * factor) // if first value is greater than threshold, 'change' x2
else (x1, x2) // otherwise leave it as it is
}.toMap
Scala is generally functional, which means you do not change values, but create new values, for example you do not write x._2 = …, since tuple is immutable (you can't change it), but create a new tuple.
This will do what you need.
arr3.map(x => if(x._1 > treshold) (x._1, x._2 * factor) else x)
The key here is that you can return tuple from the map lambda expression by putting two variable into (..).
Edit: You want to change every element of an array without creating a new array. Then you need to do the next.
arr3.indices.foreach(x => if(arr3(x)._1 > treshold) (arr3(x)._1, arr3(x)._2 * factor) else x)
I do not get to understand yet the aggregate function:
For example, having:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
The result will be: (21,6)
Well, I think that (x,y) => (x._1 + y._1, x._2 + y._2) is to get the result in parallel, for example it will be (1 + 2, 1 + 1) and so on.
But exactly this part that leaves me confused:
(x, y) => (x._1 + y, x._2 + 1)
why x._1 + y? and here x._2 is 0?
Thanks in advance.
First of all Thanks to Diego's reply which helped me connect the dots in understanding aggregate() function..
Let me confess that I couldn't sleep last night properly because I couldn't get how aggregate() works internally, I'll get good sleep tonight definitely :-)
Let's start understanding it
val result = List(1,2,3,4,5,6,7,8,9,10).par.aggregate((0, 0))
(
(x, y) => (x._1 + y, x._2 + 1),
(x,y) =>(x._1 + y._1, x._2 + y._2)
)
result: (Int, Int) = (55,10)
aggregate function has 3 parts :
initial value of accumulators : tuple(0,0) here
seqop : It works like foldLeft with initial value of 0
combop : It combines the result generated through parallelization (this part was difficult for me to understand)
Let's understand all 3 parts independently :
part-1 : Initial tuple (0,0)
Aggregate() starts with initial value of accumulators x which is (0,0) here. First tuple x._1 which is initially 0 is used to compute the sum, Second tuple x._2 is used to compute total number of elements in the list.
part-2 : (x, y) => (x._1 + y, x._2 + 1)
If you know how foldLeft works in scala then it should be easy to understand this part. Above function works just like foldLeft on our List(1,2,3,4...10).
Iteration# (x._1 + y, x._2 + 1)
1 (0+1, 0+1)
2 (1+2, 1+1)
3 (3+3, 2+1)
4 (6+4, 3+1)
. ....
. ....
10 (45+10, 9+1)
thus after all 10 iteration you'll get the result (55,10).
If you understand this part the rest is very easy but for me it was the most difficult part in understanding if all the required computation are finished then what is the use of second part i.e. compop - stay tuned :-)
part 3 : (x,y) =>(x._1 + y._1, x._2 + y._2)
Well this 3rd part is combOp which combines the result generated by different threads during parallelization, remember we used 'par' in our code to enable parallel computation of list :
List(1,2,3,4,5,6,7,8,9,10).par.aggregate(....)
Apache spark is effectively using aggregate function to do parallel computation of RDD.
Let's assume that our List(1,2,3,4,5,6,7,8,9,10) is being computed by 3 threads in parallel. Here each thread is working on partial list and then our aggregate() combOp will combine the result of each thread's computation using the below code :
(x,y) =>(x._1 + y._1, x._2 + y._2)
Original list : List(1,2,3,4,5,6,7,8,9,10)
Thread1 start computing on partial list say (1,2,3,4), Thread2 computes (5,6,7,8) and Thread3 computes partial list say (9,10)
At the end of computation, Thread-1 result will be (10,4), Thread-2 result will be (26,4) and Thread-3 result will be (19,2).
At the end of parallel computation, we'll have ((10,4),(26,4),(19,2))
Iteration# (x._1 + y._1, x._2 + y._2)
1 (0+10, 0+4)
2 (10+26, 4+4)
3 (36+19, 8+2)
which is (55,10).
Finally let me re-iterate that seqOp job is to compute the sum of all the elements of list and total number of list whereas combine function's job is to combine different partial result generated during parallelization.
I hope above explanation help you understand the aggregate().
From the documentation:
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Aggregates the results of applying an operator to subsequent elements.
This is a more general form of fold and reduce. It has similar
semantics, but does not require the result to be a supertype of the
element type. It traverses the elements in different partitions
sequentially, using seqop to update the result, and then applies
combop to results from different partitions. The implementation of
this operation may operate on an arbitrary number of collection
partitions, so combop may be invoked an arbitrary number of times.
For example, one might want to process some elements and then produce
a Set. In this case, seqop would process an element and append it to
the list, while combop would concatenate two lists from different
partitions together. The initial value z would be an empty set.
pc.aggregate(Set[Int]())(_ += process(_), _ ++ _)
Another example is
calculating geometric mean from a collection of doubles (one would
typically require big doubles for this). B the type of accumulated
results z the initial value for the accumulated result of the
partition - this will typically be the neutral element for the seqop
operator (e.g. Nil for list concatenation or 0 for summation) and may
be evaluated more than once seqop an operator used to accumulate
results within a partition combop an associative operator used to
combine results from different partitions
In your example B is a Tuple2[Int, Int]. The method seqop then takes a single element from the list, scoped as y, and updates the aggregate B to (x._1 + y, x._2 + 1). So it increments the second element in the tuple. This effectively puts the sum of elements into the first element of the tuple and the number of elements into the second element of the tuple.
The method combop then takes the results from each parallel execution thread and combines them. Combination by addition provides the same results as if it were run on the list sequentially.
Using B as a tuple is likely the confusing piece of this. You can break the problem down into two sub problems to get a better idea of what this is doing. res0 is the first element in the result tuple, and res1 is the second element in the result tuple.
// Sums all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + y, (x, y) => x + y)
res0: Int = 21
// Counts all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + 1, (x, y) => x + y)
res1: Int = 6
aggregate takes 3 parameters: a seed value, a computation function and a combination function.
What it does is basically split the collection in a number of threads, compute partial results using the computation function and then combine all these partial results using the combination function.
From what I can tell, your example function will return a pair (a, b) where a is the sum of the values in the list, b is the number of values in the list. Indeed, (21, 6).
How does this work? The seed value is the (0,0) pair. For an empty list, we have a sum of 0 and a number of items 0, so this is correct.
Your computation function takes an (Int, Int) pair x, which is your partial result, and a Int y, which is the next value in the list. This is your:
(x, y) => (x._1 + y, x._2 + 1)
Indeed, the result that we want is to increase the left element of x (the accumulator) by y, and the right element of x (the counter) by 1 for each y.
Your combination function takes an (Int, Int) pair x and an (Int, Int) pair y, which are your two partial results from different parallel computations, and combines them together as:
(x,y) => (x._1 + y._1, x._2 + y._2)
Indeed, we sum independently the left parts of the pairs and right parts of the pairs.
Your confusion comes from the fact that x and y in the first function ARE NOT the same x and y of the second function. In the first function, you have x of the type of the seed value, and y of the type of the collection elements, and you return a result of the type of x. In the second function, your two parameters are both of the same type of your seed value.
Hope it's clearer now!
Adding to Rashmit answer.
CombOp is called only if the collection is processed in parallel mode.
See below example :
val listP: ParSeq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).par
val aggregateOp1 = listP.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp1)
OP : Aggregate-1,Aggregate-2,Aggregate-3,Aggregate-45,Aggregate-6,Aggregate-7,Aggregate-8,Aggregate-910
val list: Seq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val aggregateOp2 = list.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp2)
}
OP : Aggregate-12345678910
In above example, combiner operation is called only if collection is operated in parallel
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Breaking that down a little :
aggregate(accumulator)(accumulator+first_elem_of_list, (seq1,seq2)=>seq1+seq2)
Now looking at the example:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
Here:
Accumulator is (0,0)
Defined list is x
First elem of x is 1
So for each iteration, we are taking the accumulator and adding the elements of x to position 1 of the accumulator to get the sum and increasing position 2 of the accumulator by 1 to get the count. (y is the elements of the list)
(x, y) => (x._1 + y, x._2 + 1)
Now, since this is a parallel implementation, the first portion will give rise to a list of tuples like (3,2) (7,2) and (11,2). index 1 = Sum, index 2 = count of elements used to generate sum. Now the second portion comes into play. The elements of each sequence are added in a reduce fashion.
(x,y) =>(x._1 + y._1, x._2 + y._2)
rewriting with more meaningful variables:
val arr = Array(1,2,3,4,5,6)
arr.par.aggregate((0,0))((accumulator,list_elem)=>(accumulator._1+list_elem, accumulator._2+1), (seq1, seq2)=> (seq1._1+seq2._1, seq1._2+seq2._2))
Why can (1 :: xs) be inserted?
One is cons'd onto beginning of list xs.
So List(3,2,1) becomes List(1,3,2,1) but what is significance of (1 :: xs)?
I'm having trouble understanding how this works :
def product(xs : List[Int]) = (1 :: xs) reduceLeft((x , y) => x * y)
In method signature a prefix operand (in this case (1 :: xs)) is not described? :
def reduceLeft[B >: A](f: (B, A) => B): B =
(1 :: xs) is not a prefix operand.
You are actually adding 1 before your list xs.
So product(List(3,2,1)) becomes List(1,3,2,1) reduceLeft((x,y) => x * y).
The reduceLeft function will take the 2 elements on the left and replace by the result of your function (x,y) => x * y.
In your case
List(1,3,2,1) => takes (1,3) and replaces by 1* 3 = 3 new List: List(3,2,1)
List(3,2,1) => takes (3,2) and replaces by 3 *2 = 6 new List: (6,1)
Finally takes (6,1) and get the final result 6.
As multiplying by one has no effect in the product, we add the number 1 before the List to avoid an error if the List is Empty.
Remove that and try product(List()) and you will see. If the List had at least on element (1::xs) will have no effect in your function
I believe you understand cons just fine. (1 :: xs) is simply another way to express List(1,3,2,1), on which you then invoke reduceLeft.
As for a better understanding of reduceLeft, I recently blogged this exact topic.
That's not a prefix operand--it's a method invocation on a List instance. The method reduceLeft is being called on the List (1 :: xs).
(1 :: xs) reduceLeft((x , y) => x * y)
can also be written as
(1 :: xs).reduceLeft((x , y) => x * y)
Or, even more explicitly:
val myList = (1 :: xs)
myList.reduceLeft((x , y) => x * y)