Fill the data based on the date ranges in spark - scala

I have sample dataset, I want to fill the dates with 0 based on start date and end date (from 2016-01-01 to 2016-01-08).
id,date,quantity
1,2016-01-03,10
1,2016-01-04,20
1,2016-01-06,30
1,2016-01-07,20
2,2016-01-02,10
2,2016-01-03,10
2,2016-01-04,20
2,2016-01-06,20
2,2016-01-07,20
Based on the solution from below link I was able to implement partial solution.
Filling missing dates in spark dataframe column
Can someone please suggest how to fill the dates from start_date to end_date, even for start_date till end_date.
id,date,quantity
1,2016-01-01,0
1,2016-01-02,0
1,2016-01-03,10
1,2016-01-04,20
1,2016-01-05,0
1,2016-01-06,30
1,2016-01-07,20
1,2016-01-08,0
2,2016-01-01,0
2,2016-01-02,10
2,2016-01-03,10
2,2016-01-04,20
2,2016-01-05,0
2,2016-01-06,20
2,2016-01-07,20
2,2016-01-08,0

From Spark-2.4 use sequence function to generate all dates from 2016-01-01 to 2016-01--08.
Then join to the original dataframe use coalesce to get quantity and id values.
Example:
df1=sql("select explode(sequence(date('2016-01-01'),date('2016-01-08'),INTERVAL 1 DAY)) as date").\
withColumn("quantity",lit(0)).\
withColumn("id",lit(1))
df1.show()
#+----------+--------+---+
#| date|quantity| id|
#+----------+--------+---+
#|2016-01-01| 0| 1|
#|2016-01-02| 0| 1|
#|2016-01-03| 0| 1|
#|2016-01-04| 0| 1|
#|2016-01-05| 0| 1|
#|2016-01-06| 0| 1|
#|2016-01-07| 0| 1|
#|2016-01-08| 0| 1|
#+----------+--------+---+
df.show()
#+---+----------+--------+
#| id| date|quantity|
#+---+----------+--------+
#| 1|2016-01-03| 10|
#| 1|2016-01-04| 20|
#| 1|2016-01-06| 30|
#| 1|2016-01-07| 20|
#+---+----------+--------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
exprs=['date']+[coalesce(col('df.'f'{f}'),col('df1.'f'{f}')).alias(f) for f in df1.columns if f not in ['date']]
df1.\
alias("df1").\
join(df.alias("df"),['date'],'left').\
select(*exprs).\
orderBy("date").\
show()
#+----------+--------+---+
#| date|quantity| id|
#+----------+--------+---+
#|2016-01-01| 0| 1|
#|2016-01-02| 0| 1|
#|2016-01-03| 10| 1|
#|2016-01-04| 20| 1|
#|2016-01-05| 0| 1|
#|2016-01-06| 30| 1|
#|2016-01-07| 20| 1|
#|2016-01-08| 0| 1|
#+----------+--------+---+
Update:
df=spark.createDataFrame([(1,'2016-01-03',10),(1,'2016-01-04',20),(1,'2016-01-06',30),(1,'2016-01-07',20),(2,'2016-01-02',10),(2,'2016-01-03',10),(2,'2016-01-04',20),(2,'2016-01-06',20),(2,'2016-01-07',20)],["id","date","quantity"])
df1=df.selectExpr("id").distinct().selectExpr("id","explode(sequence(date('2016-01-01'),date('2016-01-08'),INTERVAL 1 DAY)) as date").withColumn("quantity",lit(0))
from pyspark.sql.functions import *
from pyspark.sql.types import *
exprs=[coalesce(col('df.'f'{f}'),col('df1.'f'{f}')).alias(f) for f in df1.columns]
df2=df1.alias("df1").join(df.alias("df"),(col("df1.date") == col("df.date"))& (col("df1.id") == col("df.id")),'left').select(*exprs)
df2.orderBy("id","date").show()
#+---+----------+--------+
#| id| date|quantity|
#+---+----------+--------+
#| 1|2016-01-01| 0|
#| 1|2016-01-02| 0|
#| 1|2016-01-03| 10|
#| 1|2016-01-04| 20|
#| 1|2016-01-05| 0|
#| 1|2016-01-06| 30|
#| 1|2016-01-07| 20|
#| 1|2016-01-08| 0|
#| 2|2016-01-01| 0|
#| 2|2016-01-02| 10|
#| 2|2016-01-03| 10|
#| 2|2016-01-04| 20|
#| 2|2016-01-05| 0|
#| 2|2016-01-06| 20|
#| 2|2016-01-07| 20|
#| 2|2016-01-08| 0|
#+---+----------+--------+

If you want to fill concretely the null values as 0, then fillna is also good.
import pyspark.sql.functions as f
from pyspark.sql import Window
df2 = df.select('id').distinct() \
.withColumn('date', f.expr('''explode(sequence(date('2016-01-01'), date('2016-01-08'), INTERVAL 1 days)) as date'''))
df2.join(df, ['id', 'date'], 'left').fillna(0).orderBy('id', 'date').show(20, False)
+---+----------+--------+
|id |date |quantity|
+---+----------+--------+
|1 |2016-01-01|0 |
|1 |2016-01-02|0 |
|1 |2016-01-03|10 |
|1 |2016-01-04|20 |
|1 |2016-01-05|0 |
|1 |2016-01-06|30 |
|1 |2016-01-07|20 |
|1 |2016-01-08|0 |
|2 |2016-01-01|0 |
|2 |2016-01-02|10 |
|2 |2016-01-03|10 |
|2 |2016-01-04|20 |
|2 |2016-01-05|0 |
|2 |2016-01-06|20 |
|2 |2016-01-07|20 |
|2 |2016-01-08|0 |
+---+----------+--------+

Related

Get % of rows that have a unique value by id

I have a pyspark dataframe that looks like this
import pandas as pd
spark.createDataFrame(
pd.DataFrame({'ch_id': [1,1,1,1,1,
2,2,2,2],
'e_id': [0,0,1,2,2,
0,0,1,1],
'seg': ['h','s','s','a','s',
'h','s','s','h']})
).show()
+-----+----+---+
|ch_id|e_id|seg|
+-----+----+---+
| 1| 0| h|
| 1| 0| s|
| 1| 1| s|
| 1| 2| a|
| 1| 2| s|
| 2| 0| h|
| 2| 0| s|
| 2| 1| s|
| 2| 1| h|
+-----+----+---+
I would like for every c_id to get:
the % of e_id for which there is one unique value of s
The output would like like this:
+----+-------+
|c_id|%_major|
+----+-------+
| 1| 66.6|
| 2| 0.0|
+----+-------+
How could I achieve that in pyspark ?

after joining two dataframes pick all columns from one dataframe on basis of primary key

I've two dataframes, I need to update records in df1 based on new updates available in df2 in pyspark.
DF1:
df1=spark.createDataFrame([(1,2),(2,3),(3,4)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 3| 4|
+---+----+
DF2:
df2=spark.createDataFrame([(1,4),(2,5)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
+---+----+
then, I'm trying to join the two dataframes.
join_con=(df1["id"] == df2["id"])
jdf=df1.join(df2,join_con,"left")
+---+----+----+----+
| id|val1| id|val1|
+---+----+----+----+
| 1| 2| 1| 4|
| 3| 4|null|null|
| 2| 3| 2| 5|
+---+----+----+----+
Now, I want to pick all columns from df2 if df2["id"] is not null, otherwise pick all columns of df1.
something like:
jdf.filter(df2.id is null).select(df1["*"])
union
jdf.filter(df2.id is not null).select(df2["*"])
so resultant DF can be:
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
+---+----+
Can someone please help with this?
Your selection expression can be a coalesce between the column in df2 followed by df1.
from pyspark.sql import functions as F
df1=spark.createDataFrame([(1,2),(2,3),(3,4), (4, 1),],["id","val1"])
df2=spark.createDataFrame([(1,4),(2,5), (4, None),],["id","val1"])
selection_expr = [F.when(df2["id"].isNotNull(), df2[c]).otherwise(df1[c]).alias(c) for c in df2.columns]
jdf.select(selection_expr).show()
"""
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
| 4|null|
+---+----+
"""
Try with coalesce function as this function gets first non null values.
expr=zip(df2.columns,df1.columns)
e1=[coalesce(df2[f[0]],df1[f[1]]).alias(f[0]) for f in expr]
jdf.select(*e1).show()
#+---+----+
#| id|val1|
#+---+----+
#| 1| 4|
#| 2| 5|
#| 3| 4|
#+---+----+

Set literal value over Window if condition suited Spark Scala

I need to check a condition over a window:
- If the column IND_DEF is 20, then I want to change the value of the column premium for the window to which this register belongs to, and set it to 1.
My initial Dataframe looks like this:
+--------+----+-------+-----+-------+
|policyId|name|premium|state|IND_DEF|
+--------+----+-------+-----+-------+
| 1| BK| null| KT| 40|
| 1| AK| -31| null| 30|
| 1| VZ| null| IL| 20|
| 2| VK| 32| LI| 7|
| 2| CK| 25| YNZ| 10|
| 2| CK| 0| null| 5|
| 2| VK| 30| IL| 25|
+--------+----+-------+-----+-------+
And I want to achieve this:
+--------+----+-------+-----+-------+
|policyId|name|premium|state|IND_DEF|
+--------+----+-------+-----+-------+
| 1| BK| 1| KT| 40|
| 1| AK| 1| null| 30|
| 1| VZ| 1| IL| 20|
| 2| VK| 32| LI| 7|
| 2| CK| 25| YNZ| 10|
| 2| CK| 0| null| 5|
| 2| VK| 30| IL| 25|
+--------+----+-------+-----+-------+
I am trying the following code but does not work...
val df_946 = Seq [(Int, String, Integer, String, Int)]((1,"VZ",null,"IL",20),(1, "AK", -31,null,30),(1,"BK", null,"KT",40),(2,"CK",0,null,5),(2,"CK",25,"YNZ",10),(2,"VK",30,"IL",25),(2,"VK",32,"LI",7)).toDF("policyId", "name", "premium", "state","IND_DEF").orderBy("policyId")
val winSpec = Window.partitionBy("policyId").orderBy("policyId")
val df_947 = df_946.withColumn("premium",when(col("IND_DEF") === 20,lit(1).over(winSpec)).otherwise(col("premium")))
You can generate an array of IND_DEF values via collect_list for each window partition and recreate column premium based on the array_contains condition:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, None, 40),
(1, Some(-31), 30),
(1, None, 20),
(2, Some(32), 7),
(2, Some(30), 10)
).toDF("policyId", "premium", "IND_DEF")
val win = Window.partitionBy($"policyId")
df.
withColumn("indList", collect_list($"IND_DEF").over(win)).
withColumn("premium", when(array_contains($"indList", 20), 1).otherwise($"premium")).
drop($"indList").
show
// +--------+-------+-------+
// |policyId|premium|IND_DEF|
// +--------+-------+-------+
// | 1| 1| 40|
// | 1| 1| 30|
// | 1| 1| 20|
// | 2| 32| 7|
// | 2| 30| 10|
// +--------+-------+-------+

how to join dataframes with some similar values and multiple keys / scala

I have problems to get following table. The first two tables are my source tables which i would like to join. the third table is how i would like to have it.
I tried it with an outer join and used the keys "ID" and "date" but the result is not the same like in this example. The problem is, that some def_ values in each table have the same date and i would like to get them in the same row.
I used following join:
val df_result = df_1.join(df_2, Seq("ID", "date"), "outer")
df
+----+-----+-----------+
|ID |def_a| date |
+----+-----+-----------+
| 01| 1| 2019-01-31|
| 02| 1| 2019-12-31|
| 03| 1| 2019-11-30|
| 01| 1| 2019-10-31|
df
+----+-----+-----+-----------+
|ID |def_b|def_c|date |
+----+-----+-----+-----------+
| 01| 1| 0| 2017-01-31|
| 02| 1| 1| 2019-12-31|
| 03| 1| 1| 2018-11-30|
| 03| 0| 1| 2019-11-30|
| 01| 1| 1| 2018-09-30|
| 02| 1| 1| 2018-08-31|
| 01| 1| 1| 2018-07-31|
result
+----+-----+-----+-----+-----------+
|ID |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 0| 2019-01-31|
| 02| 1| 1| 1| 2019-12-31|
| 03| 1| 0| 1| 2019-11-30|
| 01| 1| 0| 0| 2019-10-31|
| 01| 0| 1| 0| 2017-01-31|
| 03| 0| 1| 1| 2018-11-30|
| 01| 0| 1| 1| 2018-09-30|
| 02| 0| 1| 1| 2018-08-31|
| 01| 0| 1| 1| 2018-07-31|
I would be grateful for any help.
Hope the following code would be helpful —
df_result
.groupBy("ID", "date")
.agg(
max("a"),
max("b"),
max("c")
)

Pass Distinct value of one Dataframe into another Dataframe

I want to take distinct value of column from DataFrame A and Pass that into DataFrame B's explode
function to create repeat rows (DataFrameB) for each distinct value.
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
Function
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
How to pass distinctSet values to assign_utilityId
Update
+---------+
|utilityId|
+---------+
| 101|
| 101|
| 102|
+---------+
+-----+------+--------+
|index|status|timeSlot|
+-----+------+--------+
| 0| SUN| 0|
| 0| SUN| 1|
I want to take Unique value from Dataframe 1 and create new column in dataFrame 2. Like this
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102
We don't need a udf for this. I have tried with some input,please check
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+
df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+