Efficient code for imputation of negative values using pyspark - pyspark

I am working on a data set which contains item wise- date wise information about the quantity sold of that particular item. However, there are some negative values in the ' quantity sold' column which I intend to impute. The logic used here would be to replace such negative values with the mode of the quantity sold for each item at date level. I have already computed the count of each distinct value of the quantity sold and obtained the maximum quantity sold of a particular item on each given date. However I am unable to find a function that would replace the negative values with the max qty sold for each item* date combination. I am relatively newer to pyspark. Which would be best approach to use in this case?

Based on the limited information you provided , you can try something like this -
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
input_list = [
(1,10,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-08")
,(1,6,"2019-11-08")
,(1,7,"2019-11-09")
,(1,7,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,-10,"2019-11-09")
,(2,10,"2019-11-07")
,(2,3,"2019-11-07")
,(2,9,"2019-11-07")
,(2,9,"2019-11-08")
,(2,-10,"2019-11-08")
,(2,5,"2019-11-09")
,(2,5,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,-10,"2019-11-09")
]
sparkDF = sql.createDataFrame(input_list,['product_id','sold_qty','date'])
sparkDF = sparkDF.withColumn('date',F.to_date(F.col('date'), 'yyyy-MM-dd'))
Mode Implementation
#### Mode Implemention
modeDF = sparkDF.groupBy('date', 'sold_qty')\
.agg(F.count(F.col('sold_qty')).alias('mode_count'))\
.select(F.col('date'),F.col('sold_qty').alias('mode_sold_qty'),F.col('mode_count'))
window = Window.partitionBy("date").orderBy(F.desc("mode_count"))
#### Filtering out the most occurred value
modeDF = modeDF\
.withColumn('order', F.row_number().over(window))\
.where(F.col('order') == 1)\
Merging back with Base DataFrame to impute
sparkDF = sparkDF.join(modeDF
,sparkDF['date'] == modeDF['date']
,'inner'
).select(sparkDF['*'],modeDF['mode_sold_qty'],modeDF['mode_count'])
sparkDF = sparkDF.withColumn('imputed_sold_qty',F.when(F.col('sold_qty') < 0,F.col('mode_sold_qty'))\
.otherwise(F.col('sold_qty')))
>>> sparkDF.show(100)
+----------+--------+----------+-------------+----------+----------------+
|product_id|sold_qty| date|mode_sold_qty|mode_count|imputed_sold_qty|
+----------+--------+----------+-------------+----------+----------------+
| 1| 7|2019-11-09| 2| 3| 7|
| 1| 7|2019-11-09| 2| 3| 7|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| -10|2019-11-09| 2| 3| 2|
| 2| 5|2019-11-09| 2| 3| 5|
| 2| 5|2019-11-09| 2| 3| 5|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| -10|2019-11-09| 2| 3| 2|
| 1| 5|2019-11-08| 9| 1| 5|
| 1| 6|2019-11-08| 9| 1| 6|
| 2| 9|2019-11-08| 9| 1| 9|
| 2| -10|2019-11-08| 9| 1| 9|
| 1| 10|2019-11-07| 5| 2| 10|
| 1| 5|2019-11-07| 5| 2| 5|
| 1| 5|2019-11-07| 5| 2| 5|
| 2| 10|2019-11-07| 5| 2| 10|
| 2| 3|2019-11-07| 5| 2| 3|
| 2| 9|2019-11-07| 5| 2| 9|
+----------+--------+----------+-------------+----------+----------------+

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

Pyspark: How to group rows into N groups?

I am performing a df.groupBy().apply() in my pyspark script and want to create a custom column that has grouped all my rows into N (as even as possible, so rows/n) groups. That why, I can ensure the number of groups sent to my udf function everytime the script runs.
How can I do this using pyspark?
If you need an exact split, then you need windowing
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
w=Window.orderBy(F.lit(1))
tst_mod = tst.withColumn("id",(F.row_number().over(w))%3) # 3 is the group size in this example
tst_mod.show()
+----+----+----+----+---+
|col1|col2|col3|col4| id|
+----+----+----+----+---+
| 5| 3| 7| 5| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 7| 3| 9| 5| 1|
| 1| 2| 3| 4| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 5| 3| 7| 5| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 3| 2| 5| 4| 1|
| 7| 3| 9| 5| 2|
| 3| 2| 5| 4| 0|
| 1| 2| 3| 4| 1|
+----+----+----+----+---+
tst_mod.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 1| 6|
| 2| 5|
| 0| 5|
+---+-----+
If you are ok with a normal distribution, then you can try a technique called salting
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
tst_salt= tst.withColumn("salt", F.rand(seed=10)*3)
If you groupby the column salt, you will have a normally distributed group

How to replace for loop in python with map transformation in pyspark where we want to compare previous row and current row with multiple conditions

just dragged in a road block kind situation while applying map function on pyspark dataframe and need your help in coming out of this.
Though problem is even more complicated but let me simplify it with below example using dictionary and for loop, and need solution in pyspark.
Here example of python code on dummy data, I want same in pyspark map transformation with when, clause using window or any other way.
Problem - I have a pyspark dataframe with column name as keys in below dictionary and want to add/modify section column with similar logic applied in for loop in this example.
record=[
{'id':xyz,'SN':xyz,'miles':xyz,'feet':xyz,'MP':xyz,'section':xyz},
{'id':xyz,'SN':xyz,'miles':xyz,'feet':xyz,'MP':xyz,'section':xyz},
{'id':xyz,'SN':xyz,'miles':xyz,'feet':xyz,'MP':xyz,'section':xyz}
]
last_rec='null'
section=0
for cur_rec in record:
if lastTrack != null:
if (last_rec.id != cur_rec.id | last_rec.SN != cur_rec.SN):
section+=1
elif (last_rec.miles == cur_rec.miles & abs(last_rec.feet- cur_rec.feet) > 1):
section+=1
elif (last_rec.MP== 555 & cur_rec.MP != 555):
section+=1
elif (abs(last_rec.miles- cur_rec.miles) > 1):
section+=1
cur_rec['section']= section
last_rec = cur_rec
Your window function is a cumulative sum of a boolean variable.
Let's start with a sample dataframe:
import numpy as np
record_df = spark.createDataFrame(
[list(x) for x in zip(*[np.random.randint(0, 10, 100).tolist() for _ in range(5)])],
['id', 'SN', 'miles', 'feet', 'MP'])
record_df.show()
+---+---+-----+----+---+
| id| SN|miles|feet| MP|
+---+---+-----+----+---+
| 9| 5| 7| 5| 1|
| 0| 6| 3| 7| 5|
| 8| 2| 7| 3| 5|
| 0| 2| 6| 5| 8|
| 0| 8| 9| 1| 5|
| 8| 5| 1| 6| 0|
| 0| 3| 9| 0| 3|
| 6| 4| 9| 0| 8|
| 5| 8| 8| 1| 0|
| 3| 0| 9| 9| 9|
| 1| 1| 2| 7| 0|
| 1| 3| 7| 7| 6|
| 4| 9| 5| 5| 5|
| 3| 6| 0| 0| 0|
| 5| 5| 5| 9| 3|
| 8| 3| 7| 8| 1|
| 7| 1| 3| 1| 8|
| 3| 1| 5| 2| 5|
| 6| 2| 3| 5| 6|
| 9| 4| 5| 9| 1|
+---+---+-----+----+---+
A cumulative sum is an ordered window function, therefore we'll need to use monotonically_increasing_id to give an order to our rows:
import pyspark.sql.functions as psf
record_df = record_df.withColumn(
'rn',
psf.monotonically_increasing_id())
For the boolean variable we'll need to use lag:
from pyspark.sql import Window
w = Window.orderBy('rn')
record_df = record_df.select(
record_df.columns
+ [psf.lag(c).over(w).alias('prev_' + c) for c in ['id', 'SN', 'miles', 'feet', 'MP']])
Since all the conditions yield the same result on section, it is an orclause:
clause = (psf.col("prev_id") != psf.col("id")) | (psf.col("prev_SN") != psf.col("SN")) \
| ((psf.col("prev_miles") == psf.col("miles")) & (psf.abs(psf.col("prev_feet") - psf.col("feet")) > 1)) \
| ((psf.col("prev_MP") == 555) & (psf.col("MP") != 555)) \
| (psf.abs(psf.col("prev_miles") - psf.col("miles")) > 1)
record_df = record_df.withColumn("tmp", (clause).cast('int'))
And finally for the cumulative sum
record_df = record_df.withColumn("section", psf.sum("tmp").over(w))

How to select all columns in spark sql query in aggregation function

Hi I am new to spark sql.
I have a query like this.
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This prints only 3 columns.
tagShortID,Timestamp,maxAvgValue
But I want to display all the column along with this column.Any help or suggestion would be appreciated.
One alternative, usually good for your specific case is to use Window Functions, because it avoids the need to join with the original data:
import org.apache.spark.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("tagShortID", "Timestamp")
val result = averageDF.withColumn("maxAvgValue", max($"RSSI_Weight_avg").over(windowSpec))
You can find here a good article explaining the Window Functions functionality in Spark.
Please note that it requires either Spark 2+ or a HiveContext in Spark versions 1.4 ~ 1.6.
Here is the simple example with the column name you have
This is your averageDF dataframe with dummy data
+----------+---------+---------------+---------+--------+---------------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|
+----------+---------+---------------+---------+--------+---------------+
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+
After you have a groupby and aggravation
val highvalueresult = averageDF.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg").groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg").alias("maxAvgValue"))
This did not return all the columns you selected because after groupby and aggregation the only the used and result column are returned, As below
+----------+---------+-----------+
|tagShortID|Timestamp|maxAvgValue|
+----------+---------+-----------+
| 2| 2| 2|
| 1| 1| 1|
+----------+---------+-----------+
To get all the columns you need to join this two dataframes
averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp"))
and the final result will be
+----------+---------+---------------+---------+--------+---------------+-----------+
|tagShortID|Timestamp|ListenerShortID|rootOrgID|subOrgID|RSSI_Weight_avg|maxAvgValue|
+----------+---------+---------------+---------+--------+---------------+-----------+
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 2| 2| 2| 2| 2| 2| 2|
| 1| 1| 1| 1| 1| 1| 1|
| 1| 1| 1| 1| 1| 1| 1|
+----------+---------+---------------+---------+--------+---------------+-----------+
I hope this clears your confusion.

spark sql conditional maximum

I have a tall table which contains up to 10 values per group. How can I transform this table into a wide format i.e. add 2 columns where these resemble the value smaller or equal to a threshold?
I want to find the maximum per group, but it needs to be smaller than a specified value like:
min(max('value1), lit(5)).over(Window.partitionBy('grouping))
However min()will only work for a column and not for the Scala value which is returned from the inner function?
The problem can be described as:
Seq(Seq(1,2,3,4).max,5).min
Where Seq(1,2,3,4) is returned by the window.
How can I formulate this in spark sql?
edit
E.g.
+--------+-----+---------+
|grouping|value|something|
+--------+-----+---------+
| 1| 1| first|
| 1| 2| second|
| 1| 3| third|
| 1| 4| fourth|
| 1| 7| 7|
| 1| 10| 10|
| 21| 1| first|
| 21| 2| second|
| 21| 3| third|
+--------+-----+---------+
created by
case class MyThing(grouping: Int, value:Int, something:String)
val df = Seq(MyThing(1,1, "first"), MyThing(1,2, "second"), MyThing(1,3, "third"),MyThing(1,4, "fourth"),MyThing(1,7, "7"), MyThing(1,10, "10"),
MyThing(21,1, "first"), MyThing(21,2, "second"), MyThing(21,3, "third")).toDS
Where
df
.withColumn("somethingAtLeast5AndMaximum5", max('value).over(Window.partitionBy('grouping)))
.withColumn("somethingAtLeast6OupToThereshold2", max('value).over(Window.partitionBy('grouping)))
.show
returns
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5| somethingAtLeast6OupToThereshold2 |
+--------+-----+---------+----------------------------+-------------------------+
| 1| 1| first| 10| 10|
| 1| 2| second| 10| 10|
| 1| 3| third| 10| 10|
| 1| 4| fourth| 10| 10|
| 1| 7| 7| 10| 10|
| 1| 10| 10| 10| 10|
| 21| 1| first| 3| 3|
| 21| 2| second| 3| 3|
| 21| 3| third| 3| 3|
+--------+-----+---------+----------------------------+-------------------------+
Instead, I rather would want to formulate:
lit(Seq(max('value).asInstanceOf[java.lang.Integer], new java.lang.Integer(2)).min).over(Window.partitionBy('grouping))
But that does not work as max('value) is not a scalar value.
Expected output should look like
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5|somethingAtLeast6OupToThereshold2|
+--------+-----+---------+----------------------------+-------------------------+
| 1| 4| fourth| 4| 7|
| 21| 1| first| 3| NULL|
+--------+-----+---------+----------------------------+-------------------------+
edit2
When trying a pivot
df.groupBy("grouping").pivot("value").agg(first('something)).show
+--------+-----+------+-----+------+----+----+
|grouping| 1| 2| 3| 4| 7| 10|
+--------+-----+------+-----+------+----+----+
| 1|first|second|third|fourth| 7| 10|
| 21|first|second|third| null|null|null|
+--------+-----+------+-----+------+----+----+
The second part of the problem remains that some columns might not exist or be null.
When aggregating to arrays:
df.groupBy("grouping").agg(collect_list('value).alias("value"), collect_list('something).alias("something"))
+--------+-------------------+--------------------+
|grouping| value| something|
+--------+-------------------+--------------------+
| 1|[1, 2, 3, 4, 7, 10]|[first, second, t...|
| 21| [1, 2, 3]|[first, second, t...|
+--------+-------------------+--------------------+
The values are already next to each other, but the right values need to be selected. This is probably still more efficient than a join or window function.
Would be easier to do in two separate steps - calculate max over Window, and then use when...otherwise on result to produce min(x, 5):
df.withColumn("tmp", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('tmp > lit(5), 5).otherwise('tmp))
EDIT: some example data to clarify this:
val df = Seq((1, 1),(1, 2),(1, 3),(1, 4),(2, 7),(2, 8))
.toDF("grouping", "value1")
df.withColumn("result", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('result > lit(5), 5).otherwise('result))
.show()
// +--------+------+------+
// |grouping|value1|result|
// +--------+------+------+
// | 1| 1| 4| // 4, because Seq(Seq(1,2,3,4).max,5).min = 4
// | 1| 2| 4|
// | 1| 3| 4|
// | 1| 4| 4|
// | 2| 7| 5| // 5, because Seq(Seq(7,8).max,5).min = 5
// | 2| 8| 5|
// +--------+------+------+