I'm trying to find the max of a column grouped by spark partition id. I'm getting the wrong value when applying the max function though. Here is the code:
val partitionCol = uuid()
val localRankCol = "test"
df = df.withColumn(partitionCol, spark_partition_id)
val windowSpec = WindowSpec.partitionBy(partitionCol).orderBy(sortExprs:_*)
val rankDF = df.withColumn(localRankCol, dense_rank().over(windowSpec))
val rankRangeDF = rankDF.agg(max(localRankCol))
rankRangeDF.show(false)
sortExprs is applying an ascending sort on sales.
And the result with some dummy data is (partitionCol is 5th column):
+--------------+------+-----+---------------------------------+--------------------------------+----+
|title |region|sales|r6bea781150fa46e3a0ed761758a50dea|5683151561af407282380e6cf25f87b5|test|
+--------------+------+-----+---------------------------------+--------------------------------+----+
|Die Hard |US |100.0|1 |0 |1 |
|Rambo |US |100.0|1 |0 |1 |
|Die Hard |AU |200.0|1 |0 |2 |
|House of Cards|EU |400.0|1 |0 |3 |
|Summer Break |US |400.0|1 |0 |3 |
|Rambo |EU |100.0|1 |1 |1 |
|Summer Break |APAC |200.0|1 |1 |2 |
|Rambo |APAC |300.0|1 |1 |3 |
|House of Cards|US |500.0|1 |1 |4 |
+--------------+------+-----+---------------------------------+--------------------------------+----+
+---------+
|max(test)|
+---------+
|5 |
+---------+
"test" column has a max value of 4 but 5 is being returned.
create unique runid and append in output dataframe for each time we run spark scala code
Below is my Output dataframe i want to add 1 more column for runid , can anyone help?
+--------+-------------------------------+---+
|order_id|Diff |id |
+--------+-------------------------------+---+
|12 |order_status |1 |
|1 |order_customer_id order_status |1 |
|68885 |New row in DataFrame 2 |1 |
|68886 |New row in DataFrame 2 |1 |
|2 |order_customer_id |1 |
|12 |order_status |2 |
|1 |order_customer_id order_status |2 |
|68885 |New row in DataFrame 2 |2 |
|68886 |New row in DataFrame 2 |2 |
|2 |order_customer_id |2 |
+--------+-------------------------------+---+
I need to filter a dataframe with the below criteria.
I have 2 columns 4Wheel(Subaru, Toyota, GM, null/empty) and 2Wheel(Yamaha, Harley, Indian, null/empty).
I have to filter on 4Wheel with values (Subaru, Toyota), if 4Wheel contain empty/null then filter on 2Wheel with values (Yamaha, Harley)
I couldn't find this type of filtering in different examples. I am new to spark/scala, so could not get enough idea to implement this.
Thanks,
Barun.
You can use spark SQL built-in function when to check if a column is null or empty, and filter accordingly:
import org.apache.spark.sql.functions.{col, when}
dataframe.filter(when(col("4Wheel").isNull || col("4Wheel").equalTo(""),
col("2Wheel").isin("Yamaha", "Harley")
).otherwise(
col("4Wheel").isin("Subaru", "Toyota")
))
So if you have the following input:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|3 |GM |null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
|8 |null |Indian|
|9 | |Indian|
|10 |null |null |
+---+------+------+
You get the following filtered ouput:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
+---+------+------+
I have dataset I want to replace the result column based on the least value of quantity by grouping id,date
id,date,quantity,result
1,2016-01-01,245,1
1,2016-01-01,345,3
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Here the output, replace the quantity which has least value in that groupby(id,date). Here ordering of rows doesn't matter, any order it can be.
id,date,quantity,result
1,2016-01-01,245,2
1,2016-01-01,345,2
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Use the Window and get the maximum by max.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('id', 'date')
df.withColumn('result', f.when(f.col('quantity') == f.min('quantity').over(w), f.col('result'))) \
.withColumn('result', f.max('result').over(w)).show(10, False)
+---+----------+--------+------+
|id |date |quantity|result|
+---+----------+--------+------+
|1 |2016-01-02|120 |5 |
|1 |2016-01-01|245 |2 |
|1 |2016-01-01|345 |2 |
|1 |2016-01-01|123 |2 |
|2 |2016-01-02|453 |1 |
|2 |2016-01-01|567 |1 |
|2 |2016-01-01|568 |1 |
+---+----------+--------+------+
I'm looking to select unique values from each column of a table and output the results into a single table. Take the following example table:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |A |"red" |
|3 |"apples" |B |"blue" |
+------+---------------+------+---------------+
the ideal output would be:
+------+---------------+------+---------------+
|col1 |col2 |col_3 |col_4 |
+------+---------------+------+---------------+
|1 |"apples" |A |"red" |
|2 |"bananas" |B |"blue" |
|3 | | | |
+------+---------------+------+---------------+
Thank you!
Edit: My actual table has many more columns, so ideally the SQL query can be done via a SELECT * as opposed to 4 individual select queries within the FROM statement.