How to select all columns in spark sql query in aggregation function - scala

Hi I am new to spark sql.
I have a query like this.
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This prints only 3 columns.
tagShortID,Timestamp,maxAvgValue
But I want to display all the column along with this column.Any help or suggestion would be appreciated.

One alternative, usually good for your specific case is to use Window Functions, because it avoids the need to join with the original data:
import org.apache.spark.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("tagShortID", "Timestamp")
val result = averageDF.withColumn("maxAvgValue", max($"RSSI_Weight_avg").over(windowSpec))
You can find here a good article explaining the Window Functions functionality in Spark.
Please note that it requires either Spark 2+ or a HiveContext in Spark versions 1.4 ~ 1.6.

Here is the simple example with the column name you have
This is your averageDF dataframe with dummy data
+----------+---------+---------------+---------+--------+---------------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|
+----------+---------+---------------+---------+--------+---------------+
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+
After you have a groupby and aggravation
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This did not return all the columns you selected because after groupby and aggregation the only the used and result column are returned, As below
+----------+---------+-----------+
|tagShortID|Timestamp|maxAvgValue|
+----------+---------+-----------+
| 2| 2| 2|
| 1| 1| 1|
+----------+---------+-----------+
To get all the columns you need to join this two dataframes
averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp"))
and the final result will be
+----------+---------+---------------+---------+--------+---------------+-----------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|maxAvgValue|
+----------+---------+---------------+---------+--------+---------------+-----------+
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+-----------+
I hope this clears your confusion.

Related

How to do a groupBy by a given column but still keep all the rows of the original DataFrame?

I want to do a groupBy and aggregate by a given column in PySpark but I still want to keep all the rows from the original DataFrame.
For example lets say we have the following DataFrame and we want to do a max on the "value" column then we would get the result below.
Original DataFrame
+--+-----+
|id|value|
+--+-----+
| 1| 1|
| 1| 2|
| 2| 3|
| 2| 4|
+--+-----+
Result
+--+-----+---+
|id|value|max|
+--+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+--+-----+---+
You can do it simply by joining aggregated dataframe with original dataframe
aggregated_df = (
df
.groupby('id')
.agg(F.max('value').alias('max'))
)
max_value_df = (
df
.join(aggregated_df, 'id')
)
Use window function
df.withColumn('max', max('value').over(Window.partitionBy('id'))).show()
+---+-----+---+
| id|value|max|
+---+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+---+-----+---+

Efficient code for imputation of negative values using pyspark

I am working on a data set which contains item wise- date wise information about the quantity sold of that particular item. However, there are some negative values in the ' quantity sold' column which I intend to impute. The logic used here would be to replace such negative values with the mode of the quantity sold for each item at date level. I have already computed the count of each distinct value of the quantity sold and obtained the maximum quantity sold of a particular item on each given date. However I am unable to find a function that would replace the negative values with the max qty sold for each item* date combination. I am relatively newer to pyspark. Which would be best approach to use in this case?
Based on the limited information you provided , you can try something like this -
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
input_list = [
(1,10,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-08")
,(1,6,"2019-11-08")
,(1,7,"2019-11-09")
,(1,7,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,-10,"2019-11-09")
,(2,10,"2019-11-07")
,(2,3,"2019-11-07")
,(2,9,"2019-11-07")
,(2,9,"2019-11-08")
,(2,-10,"2019-11-08")
,(2,5,"2019-11-09")
,(2,5,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,-10,"2019-11-09")
]
sparkDF = sql.createDataFrame(input_list,['product_id','sold_qty','date'])
sparkDF = sparkDF.withColumn('date',F.to_date(F.col('date'), 'yyyy-MM-dd'))
Mode Implementation
#### Mode Implemention
modeDF = sparkDF.groupBy('date', 'sold_qty')\
.agg(F.count(F.col('sold_qty')).alias('mode_count'))\
.select(F.col('date'),F.col('sold_qty').alias('mode_sold_qty'),F.col('mode_count'))
window = Window.partitionBy("date").orderBy(F.desc("mode_count"))
#### Filtering out the most occurred value
modeDF = modeDF\
.withColumn('order', F.row_number().over(window))\
.where(F.col('order') == 1)\
Merging back with Base DataFrame to impute
sparkDF = sparkDF.join(modeDF
,sparkDF['date'] == modeDF['date']
,'inner'
).select(sparkDF['*'],modeDF['mode_sold_qty'],modeDF['mode_count'])
sparkDF = sparkDF.withColumn('imputed_sold_qty',F.when(F.col('sold_qty') < 0,F.col('mode_sold_qty'))\
.otherwise(F.col('sold_qty')))
>>> sparkDF.show(100)
+----------+--------+----------+-------------+----------+----------------+
|product_id|sold_qty| date|mode_sold_qty|mode_count|imputed_sold_qty|
+----------+--------+----------+-------------+----------+----------------+
| 1| 7|2019-11-09| 2| 3| 7|
| 1| 7|2019-11-09| 2| 3| 7|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| -10|2019-11-09| 2| 3| 2|
| 2| 5|2019-11-09| 2| 3| 5|
| 2| 5|2019-11-09| 2| 3| 5|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| -10|2019-11-09| 2| 3| 2|
| 1| 5|2019-11-08| 9| 1| 5|
| 1| 6|2019-11-08| 9| 1| 6|
| 2| 9|2019-11-08| 9| 1| 9|
| 2| -10|2019-11-08| 9| 1| 9|
| 1| 10|2019-11-07| 5| 2| 10|
| 1| 5|2019-11-07| 5| 2| 5|
| 1| 5|2019-11-07| 5| 2| 5|
| 2| 10|2019-11-07| 5| 2| 10|
| 2| 3|2019-11-07| 5| 2| 3|
| 2| 9|2019-11-07| 5| 2| 9|
+----------+--------+----------+-------------+----------+----------------+

Pyspark: How to group rows into N groups?

I am performing a df.groupBy().apply() in my pyspark script and want to create a custom column that has grouped all my rows into N (as even as possible, so rows/n) groups. That why, I can ensure the number of groups sent to my udf function everytime the script runs.
How can I do this using pyspark?
If you need an exact split, then you need windowing
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
w=Window.orderBy(F.lit(1))
tst_mod = tst.withColumn("id",(F.row_number().over(w))%3) # 3 is the group size in this example
tst_mod.show()
+----+----+----+----+---+
|col1|col2|col3|col4| id|
+----+----+----+----+---+
| 5| 3| 7| 5| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 7| 3| 9| 5| 1|
| 1| 2| 3| 4| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 5| 3| 7| 5| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 3| 2| 5| 4| 1|
| 7| 3| 9| 5| 2|
| 3| 2| 5| 4| 0|
| 1| 2| 3| 4| 1|
+----+----+----+----+---+
tst_mod.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 1| 6|
| 2| 5|
| 0| 5|
+---+-----+
If you are ok with a normal distribution, then you can try a technique called salting
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
tst_salt= tst.withColumn("salt", F.rand(seed=10)*3)
If you groupby the column salt, you will have a normally distributed group

How to union 2 dataframe without creating additional rows?

I have 2 dataframes and I wanted to do .filter($"item" === "a") while keeping the "S/N" in number numbers.
I tried the following but it ended up with additional rows when I use union. Is there a way to union 2 dataframes without creating additional rows?
var DF1 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
var DF2 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
DF2 = DF2.filter($"item"==="a")
DF3=DF1.withColumn("item",lit(0)).withColumn("value",lit(0))
DF1.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| b| 3|
| 4| b| 4|
| 5| a| 2|
+---+----+-----+
DF2.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
+---+----+-----+
DF3.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
DF2.union(someDF3).show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
Left outer join your S/Ns with filtered dataframe, then use coalesce to get rid of nulls:
val DF3 = DF1.select("S/N")
val DF4 = (DF3.join(DF2, Seq("S/N"), joinType="leftouter")
.withColumn("item", coalesce($"item", lit(0)))
.withColumn("value", coalesce($"value", lit(0))))
DF4.show
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| 0| 0|
| 4| 0| 0|
| 5| a| 2|
+---+----+-----+

Filtering rows based on subsequent row values in spark dataframe [duplicate]

I have a dataframe(spark):
id value
3 0
3 1
3 0
4 1
4 0
4 0
I want to create a new dataframe:
3 0
3 1
4 1
Need to remove all the rows after 1(value) for each id.I tried with window functions in spark dateframe(Scala). But couldn't able to find a solution.Seems to be I am going in a wrong direction.
I am looking for a solution in Scala.Thanks
Output using monotonically_increasing_id
scala> val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data: org.apache.spark.sql.DataFrame = [id: int, value: int]
scala> val minIdx = dataWithIndex.filter($"value" === 1).groupBy($"id").agg(min($"idx")).toDF("r_id", "min_idx")
minIdx: org.apache.spark.sql.DataFrame = [r_id: int, min_idx: bigint]
scala> dataWithIndex.join(minIdx,($"r_id" === $"id") && ($"idx" <= $"min_idx")).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
The solution wont work if we did a sorted transformation in the original dataframe. That time the monotonically_increasing_id() is generated based on original DF rather that sorted DF.I have missed that requirement before.
All suggestions are welcome.
One way is to use monotonically_increasing_id() and a self-join:
val data = Seq((3,0),(3,1),(3,0),(4,1),(4,0),(4,0)).toDF("id", "value")
data.show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 3| 0|
| 4| 1|
| 4| 0|
| 4| 0|
+---+-----+
Now we generate a column named idx with an increasing Long:
val dataWithIndex = data.withColumn("idx", monotonically_increasing_id())
// dataWithIndex.cache()
Now we get the min(idx) for each id where value = 1:
val minIdx = dataWithIndex
.filter($"value" === 1)
.groupBy($"id")
.agg(min($"idx"))
.toDF("r_id", "min_idx")
Now we join the min(idx) back to the original DataFrame:
dataWithIndex.join(
minIdx,
($"r_id" === $"id") && ($"idx" <= $"min_idx")
).select($"id", $"value").show
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 3| 1|
| 4| 1|
+---+-----+
Note: monotonically_increasing_id() generates its value based on the partition of the row. This value may change each time dataWithIndex is re-evaluated. In my code above, because of lazy evaluation, it's only when I call the final show that monotonically_increasing_id() is evaluated.
If you want to force the value to stay the same, for example so you can use show to evaluate the above step-by-step, uncomment this line above:
// dataWithIndex.cache()
Hi I found the solution using Window and self join.
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
data.show
scala> data.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 1| 3|
| 3| 0| 1|
| 4| 1| 6|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
val sort_df=data.sort($"sorted")
scala> sort_df.show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 1|
| 3| 0| 2|
| 3| 1| 3|
| 4| 0| 4|
| 4| 0| 5|
| 4| 1| 6|
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
var window=Window.partitionBy("id").orderBy("$sorted")
val sort_idx=sort_df.select($"*",rowNumber.over(window).as("count_index"))
val minIdx=sort_idx.filter($"value"===1).groupBy("id").agg(min("count_index")).toDF("idx","min_idx")
val result_id=sort_idx.join(minIdx,($"id"===$"idx") &&($"count_index" <= $"min_idx"))
result_id.show
+---+-----+------+-----------+---+-------+
| id|value|sorted|count_index|idx|min_idx|
+---+-----+------+-----------+---+-------+
| 1| 0| 7| 1| 1| 2|
| 1| 1| 8| 2| 1| 2|
| 2| 1| 10| 1| 2| 1|
| 3| 0| 1| 1| 3| 3|
| 3| 0| 2| 2| 3| 3|
| 3| 1| 3| 3| 3| 3|
| 4| 0| 4| 1| 4| 3|
| 4| 0| 5| 2| 4| 3|
| 4| 1| 6| 3| 4| 3|
+---+-----+------+-----------+---+-------+
Still looking for a more optimized solutions.Thanks
You can simply use groupBy like this
val df2 = df1.groupBy("id","value").count().select("id","value")
Here your df1 is
id value
3 0
3 1
3 0
4 1
4 0
4 0
And resultant dataframe is df2 which is your expected output like this
id value
3 0
3 1
4 1
4 0
use isin method and filter as below:
val data = Seq((3,0,2),(3,1,3),(3,0,1),(4,1,6),(4,0,5),(4,0,4),(1,0,7),(1,1,8),(1,0,9),(2,1,10),(2,0,11),(2,0,12)).toDF("id", "value","sorted")
val idFilter = List(1, 2)
data.filter($"id".isin(idFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 1| 0| 7|
| 1| 1| 8|
| 1| 0| 9|
| 2| 1| 10|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+
Ex: filter based on val
val valFilter = List(0)
data.filter($"value".isin(valFilter:_*)).show
+---+-----+------+
| id|value|sorted|
+---+-----+------+
| 3| 0| 2|
| 3| 0| 1|
| 4| 0| 5|
| 4| 0| 4|
| 1| 0| 7|
| 1| 0| 9|
| 2| 0| 11|
| 2| 0| 12|
+---+-----+------+