I have a spark DataFrame (df) which looks like this:
+----------+--------+----------+--------+
| c1| c2| c3| c4|
+----------+--------+----------+--------+
| 1 | 5 | null| 7 |
+----------+--------+----------+--------+
| 1 | 5 | 4 | 8 |
+----------+--------+----------+--------+
| 1 | 3 | null| 11 |
+----------+--------+----------+--------+
| 1 | 3 | null| null |
+----------+--------+----------+--------+
| 2 | 6 | 23 | 17 |
+----------+--------+----------+--------+
| 2 | 6 | 7 | 3 |
+----------+--------+----------+--------+
| 2 | 3 | null| 11 |
+----------+--------+----------+--------+
| 2 | 3 | null| 17 |
+----------+--------+----------+--------+
I want to aggregate using (c1,c2) as key and have average of c3 and c4, so that I have this:
+----------+--------+----------+--------+
| c1| c2| c3| c4|
+----------+--------+----------+--------+
| 1 | 5 | 4 | 7.5 |
+----------+--------+----------+--------+
| 1 | 3 | null| 11 |
+----------+--------+----------+--------+
| 2 | 6 | 15 | 10 |
+----------+--------+----------+--------+
| 2 | 3 | null| 14 |
+----------+--------+----------+--------+
So, essentially I am ignoring the null values.
My half-baked code looks something like this:
val df1 = df.
// just working on c3 for time being
map(x => ((x.getInt(0), x.getInt(1)), x.getDouble(3))).
reduceByKey(
(x, y) => {
var temp = 0
var sum = 0.0
var flag = false
if (x == null) {
if (y != null) {
temp = temp + 1
sum = y
flag = true
}
} else {
if (y == null) {
temp = temp + 1
sum = x
} else {
temp = temp + 1
sum = x + y
flag = true
}
}
if (flag == false) {
null
} else {
sum/temp
}
}
)
Obviously, the above code is not working. Any help to make the code work is very much appreciated.
Edit 1 The answer given by #zero232 is a solution. However, it is not "the solution" I am looking for. My interest was to understand how to deal with null values when writing a custom function for reduceByKey(). I am re-asking the question below:
I want to aggregate using (c1,c2) as key and have root mean square [{sum(a_i^2)}^0.5] (or some function which is not available in spark for that matter) of c3 and c4 while ignoring the null values, so that I have this:
+----------+--------+----------+--------+
| c1| c2| c3| c4|
+----------+--------+----------+--------+
| 1 | 5 | 4 | 10.63 |
+----------+--------+----------+--------+
| 1 | 3 | null| 11 |
+----------+--------+----------+--------+
| 2 | 6 | 24.04 | 17.26 |
+----------+--------+----------+--------+
| 2 | 3 | null| 20.24 |
+----------+--------+----------+--------+
Just groupBy and use mean:
df.groupBy("c1", "c2").mean("c3", "c4")
or agg
df.groupBy("c1", "c2").agg(avg("c3"), avg("c4"))
Typically all primitive functions on DataFrames will handle correctly null values.
import org.apache.spark.sql.functions._
def rms(c: String) = sqrt(avg(pow(col(c), 2))).alias(s"rms($c)")
df.groupBy("c1", "c2").agg(rms("c3"), rms("c4"))
If you want to ignore null with RDDs just filter these out before you apply reduction:
somePairRDD.filter(_._2 != null)
.foldByKey(someDefualtValue)(someReducingFunction)
or convert values to Option and use pattern matching:
somePairRDD.mapValues(Option(_)).reduceByKey {
case (Some(x), Some(y)) => doSomething(x, y)
case (Some(x), _) => doSomething(x)
case (_, Some(_)) => doSomething(y)
case _ => someDefualt
}
or use map / flatMap / getOrElse and other standard tools to handle undefined values.
Related
I have a dataframe with different columns, what I am trying to do is the mean of this diff columns ignoring null values. For example:
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
The output have to be:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
What I am doing:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
But I get the null values:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
You can explode the columns and do a group by + mean, then join back to the original dataframe using the Baller column:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+
Input
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 1 | 0 | 1
+-------+-------+----+-------
output
+-------+-------+----+-------+-------+-------+----+-------
| id | a | b | c | a_b | a_c | b_c
+-------+-------+----+-------+-------+-------+----+-------
| 1 | 1 | 0 | 1 | 0 | 1 | 0
+-------+-------+----+-------+-------+-------+----+-------
basically I have a sequence of pair which contains Seq((a,b),(a,c),(b,c))
and thier values will be col(a)*col(b) , col(a)*col(c) col(b)*col(c) for new column
Like I know how to add them in dataFrame but not able to make a transform of return type DataFrame => DataFrame
Is this what you what?
Take a look at the API page. You will save yourself sometime :)
val df = Seq((1, 1, 0, 1))
.toDF("id", "a", "b", "c")
.withColumn("a_b", $"a" * $"b")
.withColumn("a_c", $"a" * $"c")
.withColumn("b_c", $"b" * $"c")
output ============
+---+---+---+---+---+---+---+
| id| a| b| c|a_b|a_c|b_c|
+---+---+---+---+---+---+---+
| 1| 1| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+---+
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I'm not sure of a good way to phrase the question, but an example will help. Here is the dataframe that I have with the columns: name, type, and count:
+------+------+-------+
| Name | Type | Count |
+------+------+-------+
| a | 0 | 5 |
| a | 1 | 4 |
| a | 5 | 5 |
| a | 4 | 5 |
| a | 2 | 1 |
| b | 0 | 2 |
| b | 1 | 4 |
| b | 3 | 5 |
| b | 4 | 5 |
| b | 2 | 1 |
| c | 0 | 5 |
| c | ... | ... |
+------+------+-------+
I want to get a new dataframe structured like this where the Type column values have become new columns:
+------+---+-----+---+---+---+---+
| Name | 0 | 1 | 2 | 3 | 4 | 5 | <- Number columns are types from input
+------+---+-----+---+---+---+---+
| a | 5 | 4 | 1 | 0 | 5 | 5 |
| b | 2 | 4 | 1 | 5 | 5 | 0 |
| c | 5 | ... | | | | |
+------+---+-----+---+---+---+---+
The columns here are [Name,0,1,2,3,4,5].
Do this by using the pivot function in Spark.
val df2 = df.groupBy("Name").pivot("Type").sum("Count")
Here, if the name and the type is the same for two rows, the count values are simply added together, but other aggregations are possible as well.
Resulting dataframe when using the example data in the question:
+----+---+----+----+----+----+----+
|Name| 0| 1| 2| 3| 4| 5|
+----+---+----+----+----+----+----+
| c| 5|null|null|null|null|null|
| b| 2| 4| 1| 5| 5|null|
| a| 5| 4| 1|null| 5| 5|
+----+---+----+----+----+----+----+
I need to "extract" some data contained in an Iterable[MyObject] (it was a RDD[MyObject] before a groupBy).
My initial RDD[MyObject] :
|-----------|---------|----------|
| startCity | endCity | Customer |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 1 | 1 |
| | |----|-----|
| | | 2 | 1 |
| | |----|-----|
| | | 3 | 50 |
|-----------|---------|----------|
| Paris | London | ID | Age |
| | |----|-----|
| | | 5 | 40 |
| | |----|-----|
| | | 6 | 41 |
| | |----|-----|
| | | 7 | 2 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 9 | 15 |
| | |----|-----|
| | | 10| 16 |
| | |----|-----|
| | | 11| 46 |
|-----------|---------|----|-----|
| New-York | Paris | ID | Age |
| | |----|-----|
| | | 13| 7 |
| | |----|-----|
| | | 14| 9 |
| | |----|-----|
| | | 15| 60 |
|-----------|---------|----|-----|
| Barcelona | London | ID | Age |
| | |----|-----|
| | | 17| 66 |
| | |----|-----|
| | | 18| 53 |
| | |----|-----|
| | | 19| 11 |
|-----------|---------|----|-----|
I need to count them by age range by and groupBy startCity - endCity
The final result should be :
|-----------|---------|-------------|
| startCity | endCity | Customer |
|-----------|---------|-------------|
| Paris | London | Range| Count|
| | |------|------|
| | |0-2 | 3 |
| | |------|------|
| | |3-18 | 0 |
| | |------|------|
| | |19-99 | 3 |
|-----------|---------|-------------|
| New-York | Paris | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 3 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
| Barcelona | London | Range| Count|
| | |------|------|
| | |0-2 | 0 |
| | |------|------|
| | |3-18 | 1 |
| | |------|------|
| | |19-99 | 2 |
|-----------|---------|-------------|
At the moment I'm doing this by count 3 times the same data (first time with 0-2 range, then 10-20, then 21-99).
Like :
Iterable[MyObject] ite
ite.count(x => x.age match {
case Some(age) => { age >= 0 && age < 2 }
}
It's working by giving me an Integer but not efficient at all I think since I have to count many times, what's the best way to do this please ?
Thanks
EDIT : The Customer object is a case class
def computeRange(age : Int) =
if(age<=2)
"0-2"
else if(age<=10)
"2-10"
// etc, you get the idea
Then, with an RDD of case class MyObject(id : String, age : Int)
rdd
.map(x=> computeRange(x.age) -> 1)
.reduceByKey(_+_)
Edit:
If you need to group by some columns, you can do it this way, provided that you have a RDD[(SomeColumns, Iterable[MyObject])]. The following lines would give you a map that associates each "range" to its number of occurences.
def computeMapOfOccurances(list : Iterable[MyObject]) : Map[String, Int] =
list
.map(_.age)
.map(computeRange)
.groupBy(x=>x)
.mapValues(_.size)
val result1 = rdd
.mapValues( computeMapOfOccurances(_))
And if you need to flatten your data, you can write:
val result2 = result1
.flatMapValues(_.toSeq)
Assuming that you have Customer[Object] as a case class as below
case class Customer(ID: Int, Age: Int)
And your RDD[MyObject] is a rdd of case class as below
case class MyObject(startCity: String, endCity: String, customer: List[Customer])
So using above case classes you should be having input (that you have in table format) as below
MyObject(Paris,London,List(Customer(1,1), Customer(2,1), Customer(3,50)))
MyObject(Paris,London,List(Customer(5,40), Customer(6,41), Customer(7,2)))
MyObject(New-York,Paris,List(Customer(9,15), Customer(10,16), Customer(11,46)))
MyObject(New-York,Paris,List(Customer(13,7), Customer(14,9), Customer(15,60)))
MyObject(Barcelona,London,List(Customer(17,66), Customer(18,53), Customer(19,11)))
And you've also mentioned that after grouping you have Iterable[MyObject] which is equivalent to below step
val groupedRDD = rdd.groupBy(myobject => (myobject.startCity, myobject.endCity)) //groupedRDD: org.apache.spark.rdd.RDD[((String, String), Iterable[MyObject])] = ShuffledRDD[2] at groupBy at worksheetTest.sc:23
So the next step for you to do is to use mapValues to iterate through the Iterable[MyObject], and then count the ages belonging to each ranges, and finally converting to the output you require as below
val finalResult = groupedRDD.mapValues(x => {
val rangeAge = Map("0-2" -> 0, "3-18" -> 0, "19-99" -> 0)
val list = x.flatMap(y => y.customer.map(z => z.Age)).toList
updateCounts(list, rangeAge).map(x => CustomerOut(x._1, x._2)).toList
})
where updateCounts is a recursive function
def updateCounts(ageList: List[Int], map: Map[String, Int]) : Map[String, Int] = ageList match{
case head :: tail => if(head >= 0 && head < 3) {
updateCounts(tail, map ++ Map("0-2" -> (map("0-2")+1)))
} else if(head >= 3 && head < 19) {
updateCounts(tail, map ++ Map("3-18" -> (map("3-18")+1)))
} else updateCounts(tail, map ++ Map("19-99" -> (map("19-99")+1)))
case Nil => map
}
and CustomerOut is another case class
case class CustomerOut(Range: String, Count: Int)
so the finalResult is as below
((Barcelona,London),List(CustomerOut(0-2,0), CustomerOut(3-18,1), CustomerOut(19-99,2)))
((New-York,Paris),List(CustomerOut(0-2,0), CustomerOut(3-18,4), CustomerOut(19-99,2)))
((Paris,London),List(CustomerOut(0-2,3), CustomerOut(3-18,0), CustomerOut(19-99,3)))
I have, in a pyspark dataframe, a column with values 1, -1 and 0, representing the events "startup", "shutdown" and "other" of an engine. I want to build a column with the state of the engine, with 1 when the engine is on and 0 when it's off, something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | -1 | 0 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 1 | 1 |
| 7 | -1 | 0 |
+---+-----+-----+
If 1s and -1s are always alternated, this can easily be done with a window function such as
sum('event').over(Window.orderBy('seq'))
It can happen, though, that I have some spurious 1s or -1s, which I want to ignore if I am already in state 1 or 0, respectively. I want thus to be able to do something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | -1 | 0 |
| 5 | 0 | 0 |
| 6 | -1 | 0 |
| 7 | 1 | 1 |
+---+-----+-----+
That would require a "saturated" sum function that never goes above 1 or below 0, or some other approach which I am not able to imagine at the moment.
Does somebody have any ideas?
You can achieve the desired result using the last function to fill-forwards the most recent state-change.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = (spark.createDataFrame([
(1, 1),
(2, 0),
(3, -1),
(4, 0)
], ["seq", "event"]))
w = Window.orderBy('seq')
#replace zeros with nulls so they can be ignored easily.
df = df.withColumn('helperCol',F.when(df.event != 0,df.event))
#fill statechanges forward in a new column.
df = df.withColumn('state',F.last(df.helperCol,ignorenulls=True).over(w))
#replace -1 values with 0
df = df.replace(-1,0,['state'])
df.show()
This produces:
+---+-----+---------+-----+
|seq|event|helperCol|state|
+---+-----+---------+-----+
| 1| 1| 1| 1|
| 2| 0| null| 1|
| 3| -1| -1| 0|
| 4| 0| null| 0|
+---+-----+---------+-----+
helperCol need not be added to the dataframe, I have only included it to make the process more readable.