Fill the data based on the date ranges in spark

Fill the data based on the date ranges in spark - scala

I have sample dataset, I want to fill the dates with 0 based on start date and end date (from 2016-01-01 to 2016-01-08).
id,date,quantity
1,2016-01-03,10
1,2016-01-04,20
1,2016-01-06,30
1,2016-01-07,20
2,2016-01-02,10
2,2016-01-03,10
2,2016-01-04,20
2,2016-01-06,20
2,2016-01-07,20
Based on the solution from below link I was able to implement partial solution.
Filling missing dates in spark dataframe column
Can someone please suggest how to fill the dates from start_date to end_date, even for start_date till end_date.
id,date,quantity
1,2016-01-01,0
1,2016-01-02,0
1,2016-01-03,10
1,2016-01-04,20
1,2016-01-05,0
1,2016-01-06,30
1,2016-01-07,20
1,2016-01-08,0
2,2016-01-01,0
2,2016-01-02,10
2,2016-01-03,10
2,2016-01-04,20
2,2016-01-05,0
2,2016-01-06,20
2,2016-01-07,20
2,2016-01-08,0

From Spark-2.4 use sequence function to generate all dates from 2016-01-01 to 2016-01--08.
Then join to the original dataframe use coalesce to get quantity and id values.
Example:
df1=sql("select explode(sequence(date('2016-01-01'),date('2016-01-08'),INTERVAL 1 DAY)) as date").\
withColumn("quantity",lit(0)).\
withColumn("id",lit(1))
df1.show()
#+----------+--------+---+
#| date|quantity| id|
#+----------+--------+---+
#|2016-01-01| 0| 1|
#|2016-01-02| 0| 1|
#|2016-01-03| 0| 1|
#|2016-01-04| 0| 1|
#|2016-01-05| 0| 1|
#|2016-01-06| 0| 1|
#|2016-01-07| 0| 1|
#|2016-01-08| 0| 1|
#+----------+--------+---+
df.show()
#+---+----------+--------+
#| id| date|quantity|
#+---+----------+--------+
#| 1|2016-01-03| 10|
#| 1|2016-01-04| 20|
#| 1|2016-01-06| 30|
#| 1|2016-01-07| 20|
#+---+----------+--------+
from pyspark.sql.functions import *
from pyspark.sql.types import *
exprs=['date']+[coalesce(col('df.'f'{f}'),col('df1.'f'{f}')).alias(f) for f in df1.columns if f not in ['date']]
df1.\
alias("df1").\
join(df.alias("df"),['date'],'left').\
select(*exprs).\
orderBy("date").\
show()
#+----------+--------+---+
#| date|quantity| id|
#+----------+--------+---+
#|2016-01-01| 0| 1|
#|2016-01-02| 0| 1|
#|2016-01-03| 10| 1|
#|2016-01-04| 20| 1|
#|2016-01-05| 0| 1|
#|2016-01-06| 30| 1|
#|2016-01-07| 20| 1|
#|2016-01-08| 0| 1|
#+----------+--------+---+
Update:
df=spark.createDataFrame([(1,'2016-01-03',10),(1,'2016-01-04',20),(1,'2016-01-06',30),(1,'2016-01-07',20),(2,'2016-01-02',10),(2,'2016-01-03',10),(2,'2016-01-04',20),(2,'2016-01-06',20),(2,'2016-01-07',20)],["id","date","quantity"])
df1=df.selectExpr("id").distinct().selectExpr("id","explode(sequence(date('2016-01-01'),date('2016-01-08'),INTERVAL 1 DAY)) as date").withColumn("quantity",lit(0))
from pyspark.sql.functions import *
from pyspark.sql.types import *
exprs=[coalesce(col('df.'f'{f}'),col('df1.'f'{f}')).alias(f) for f in df1.columns]
df2=df1.alias("df1").join(df.alias("df"),(col("df1.date") == col("df.date"))& (col("df1.id") == col("df.id")),'left').select(*exprs)
df2.orderBy("id","date").show()
#+---+----------+--------+
#| id| date|quantity|
#+---+----------+--------+
#| 1|2016-01-01| 0|
#| 1|2016-01-02| 0|
#| 1|2016-01-03| 10|
#| 1|2016-01-04| 20|
#| 1|2016-01-05| 0|
#| 1|2016-01-06| 30|
#| 1|2016-01-07| 20|
#| 1|2016-01-08| 0|
#| 2|2016-01-01| 0|
#| 2|2016-01-02| 10|
#| 2|2016-01-03| 10|
#| 2|2016-01-04| 20|
#| 2|2016-01-05| 0|
#| 2|2016-01-06| 20|
#| 2|2016-01-07| 20|
#| 2|2016-01-08| 0|
#+---+----------+--------+

If you want to fill concretely the null values as 0, then fillna is also good.
import pyspark.sql.functions as f
from pyspark.sql import Window
df2 = df.select('id').distinct() \
.withColumn('date', f.expr('''explode(sequence(date('2016-01-01'), date('2016-01-08'), INTERVAL 1 days)) as date'''))
df2.join(df, ['id', 'date'], 'left').fillna(0).orderBy('id', 'date').show(20, False)
+---+----------+--------+
|id |date |quantity|
+---+----------+--------+
|1 |2016-01-01|0 |
|1 |2016-01-02|0 |
|1 |2016-01-03|10 |
|1 |2016-01-04|20 |
|1 |2016-01-05|0 |
|1 |2016-01-06|30 |
|1 |2016-01-07|20 |
|1 |2016-01-08|0 |
|2 |2016-01-01|0 |
|2 |2016-01-02|10 |
|2 |2016-01-03|10 |
|2 |2016-01-04|20 |
|2 |2016-01-05|0 |
|2 |2016-01-06|20 |
|2 |2016-01-07|20 |
|2 |2016-01-08|0 |
+---+----------+--------+

Related

Get % of rows that have a unique value by id

I have a pyspark dataframe that looks like this
import pandas as pd
spark.createDataFrame(
pd.DataFrame({'ch_id': [1,1,1,1,1,
2,2,2,2],
'e_id': [0,0,1,2,2,
0,0,1,1],
'seg': ['h','s','s','a','s',
'h','s','s','h']})
).show()
+-----+----+---+
|ch_id|e_id|seg|
+-----+----+---+
| 1| 0| h|
| 1| 0| s|
| 1| 1| s|
| 1| 2| a|
| 1| 2| s|
| 2| 0| h|
| 2| 0| s|
| 2| 1| s|
| 2| 1| h|
+-----+----+---+
I would like for every c_id to get:
the % of e_id for which there is one unique value of s
The output would like like this:
+----+-------+
|c_id|%_major|
+----+-------+
| 1| 66.6|
| 2| 0.0|
+----+-------+
How could I achieve that in pyspark ?

after joining two dataframes pick all columns from one dataframe on basis of primary key

I've two dataframes, I need to update records in df1 based on new updates available in df2 in pyspark.
DF1:
df1=spark.createDataFrame([(1,2),(2,3),(3,4)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 2|
| 2| 3|
| 3| 4|
+---+----+
DF2:
df2=spark.createDataFrame([(1,4),(2,5)],["id","val1"])
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
+---+----+
then, I'm trying to join the two dataframes.
join_con=(df1["id"] == df2["id"])
jdf=df1.join(df2,join_con,"left")
+---+----+----+----+
| id|val1| id|val1|
+---+----+----+----+
| 1| 2| 1| 4|
| 3| 4|null|null|
| 2| 3| 2| 5|
+---+----+----+----+
Now, I want to pick all columns from df2 if df2["id"] is not null, otherwise pick all columns of df1.
something like:
jdf.filter(df2.id is null).select(df1["*"])
union
jdf.filter(df2.id is not null).select(df2["*"])
so resultant DF can be:
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
+---+----+
Can someone please help with this?

Your selection expression can be a coalesce between the column in df2 followed by df1.
from pyspark.sql import functions as F
df1=spark.createDataFrame([(1,2),(2,3),(3,4), (4, 1),],["id","val1"])
df2=spark.createDataFrame([(1,4),(2,5), (4, None),],["id","val1"])
selection_expr = [F.when(df2["id"].isNotNull(), df2[c]).otherwise(df1[c]).alias(c) for c in df2.columns]
jdf.select(selection_expr).show()
"""
+---+----+
| id|val1|
+---+----+
| 1| 4|
| 2| 5|
| 3| 4|
| 4|null|
+---+----+
"""

Try with coalesce function as this function gets first non null values.
expr=zip(df2.columns,df1.columns)
e1=[coalesce(df2[f[0]],df1[f[1]]).alias(f[0]) for f in expr]
jdf.select(*e1).show()
#+---+----+
#| id|val1|
#+---+----+
#| 1| 4|
#| 2| 5|
#| 3| 4|
#+---+----+

Set literal value over Window if condition suited Spark Scala

I need to check a condition over a window:
- If the column IND_DEF is 20, then I want to change the value of the column premium for the window to which this register belongs to, and set it to 1.
My initial Dataframe looks like this:
+--------+----+-------+-----+-------+
|policyId|name|premium|state|IND_DEF|
+--------+----+-------+-----+-------+
| 1| BK| null| KT| 40|
| 1| AK| -31| null| 30|
| 1| VZ| null| IL| 20|
| 2| VK| 32| LI| 7|
| 2| CK| 25| YNZ| 10|
| 2| CK| 0| null| 5|
| 2| VK| 30| IL| 25|
+--------+----+-------+-----+-------+
And I want to achieve this:
+--------+----+-------+-----+-------+
|policyId|name|premium|state|IND_DEF|
+--------+----+-------+-----+-------+
| 1| BK| 1| KT| 40|
| 1| AK| 1| null| 30|
| 1| VZ| 1| IL| 20|
| 2| VK| 32| LI| 7|
| 2| CK| 25| YNZ| 10|
| 2| CK| 0| null| 5|
| 2| VK| 30| IL| 25|
+--------+----+-------+-----+-------+
I am trying the following code but does not work...
val df_946 = Seq [(Int, String, Integer, String, Int)]((1,"VZ",null,"IL",20),(1, "AK", -31,null,30),(1,"BK", null,"KT",40),(2,"CK",0,null,5),(2,"CK",25,"YNZ",10),(2,"VK",30,"IL",25),(2,"VK",32,"LI",7)).toDF("policyId", "name", "premium", "state","IND_DEF").orderBy("policyId")
val winSpec = Window.partitionBy("policyId").orderBy("policyId")
val df_947 = df_946.withColumn("premium",when(col("IND_DEF") === 20,lit(1).over(winSpec)).otherwise(col("premium")))

You can generate an array of IND_DEF values via collect_list for each window partition and recreate column premium based on the array_contains condition:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, None, 40),
(1, Some(-31), 30),
(1, None, 20),
(2, Some(32), 7),
(2, Some(30), 10)
).toDF("policyId", "premium", "IND_DEF")
val win = Window.partitionBy($"policyId")
df.
withColumn("indList", collect_list($"IND_DEF").over(win)).
withColumn("premium", when(array_contains($"indList", 20), 1).otherwise($"premium")).
drop($"indList").
show
// +--------+-------+-------+
// |policyId|premium|IND_DEF|
// +--------+-------+-------+
// | 1| 1| 40|
// | 1| 1| 30|
// | 1| 1| 20|
// | 2| 32| 7|
// | 2| 30| 10|
// +--------+-------+-------+

how to join dataframes with some similar values and multiple keys / scala

I have problems to get following table. The first two tables are my source tables which i would like to join. the third table is how i would like to have it.
I tried it with an outer join and used the keys "ID" and "date" but the result is not the same like in this example. The problem is, that some def_ values in each table have the same date and i would like to get them in the same row.
I used following join:
val df_result = df_1.join(df_2, Seq("ID", "date"), "outer")
df
+----+-----+-----------+
|ID |def_a| date |
+----+-----+-----------+
| 01| 1| 2019-01-31|
| 02| 1| 2019-12-31|
| 03| 1| 2019-11-30|
| 01| 1| 2019-10-31|
df
+----+-----+-----+-----------+
|ID |def_b|def_c|date |
+----+-----+-----+-----------+
| 01| 1| 0| 2017-01-31|
| 02| 1| 1| 2019-12-31|
| 03| 1| 1| 2018-11-30|
| 03| 0| 1| 2019-11-30|
| 01| 1| 1| 2018-09-30|
| 02| 1| 1| 2018-08-31|
| 01| 1| 1| 2018-07-31|
result
+----+-----+-----+-----+-----------+
|ID |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 0| 2019-01-31|
| 02| 1| 1| 1| 2019-12-31|
| 03| 1| 0| 1| 2019-11-30|
| 01| 1| 0| 0| 2019-10-31|
| 01| 0| 1| 0| 2017-01-31|
| 03| 0| 1| 1| 2018-11-30|
| 01| 0| 1| 1| 2018-09-30|
| 02| 0| 1| 1| 2018-08-31|
| 01| 0| 1| 1| 2018-07-31|
I would be grateful for any help.

Hope the following code would be helpful —
df_result
.groupBy("ID", "date")
.agg(
max("a"),
max("b"),
max("c")
)

Pass Distinct value of one Dataframe into another Dataframe

I want to take distinct value of column from DataFrame A and Pass that into DataFrame B's explode
function to create repeat rows (DataFrameB) for each distinct value.
distinctSet = targetDf.select('utilityId').distinct())
utilisationFrequencyTable = utilisationFrequencyTable.withColumn("utilityId", psf.explode(assign_utilityId()))
Function
assign_utilityId = psf.udf(
lambda id: [x for x in id],
ArrayType(LongType()))
How to pass distinctSet values to assign_utilityId
Update
+---------+
|utilityId|
+---------+
| 101|
| 101|
| 102|
+---------+
+-----+------+--------+
|index|status|timeSlot|
+-----+------+--------+
| 0| SUN| 0|
| 0| SUN| 1|
I want to take Unique value from Dataframe 1 and create new column in dataFrame 2. Like this
+-----+------+--------+--------+
|index|status|timeSlot|utilityId
+-----+------+--------+--------+
| 0| SUN| 0|101
| 0| SUN| 1|101
| 0| SUN| 0|102
| 0| SUN| 1|102

We don't need a udf for this. I have tried with some input,please check
>>> from pyspark.sql import function as F
>>> df = spark.createDataFrame([(1,),(2,),(3,),(2,),(3,)],['col1'])
>>> df.show()
+----+
|col1|
+----+
| 1|
| 2|
| 3|
| 2|
| 3|
+----+
>>> df1 = spark.createDataFrame([(1,2),(2,3),(3,4)],['col1','col2'])
>>> df1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 2| 3|
| 3| 4|
+----+----+
>>> dist_val = df.select(F.collect_set('col1').alias('val')).first()['val']
>>> dist_val
[1, 2, 3]
>>> df1 = df1.withColumn('col3',F.array([F.lit(x) for x in dist_val]))
>>> df1.show()
+----+----+---------+
|col1|col2| col3|
+----+----+---------+
| 1| 2|[1, 2, 3]|
| 2| 3|[1, 2, 3]|
| 3| 4|[1, 2, 3]|
+----+----+---------+
>>> df1.select("*",F.explode('col3').alias('expl_col')).drop('col3').show()
+----+----+--------+
|col1|col2|expl_col|
+----+----+--------+
| 1| 2| 1|
| 1| 2| 2|
| 1| 2| 3|
| 2| 3| 1|
| 2| 3| 2|
| 2| 3| 3|
| 3| 4| 1|
| 3| 4| 2|
| 3| 4| 3|
+----+----+--------+

df = sqlContext.createDataFrame(sc.parallelize([(101,),(101,),(102,)]),['utilityId'])
df2 = sqlContext.createDataFrame(sc.parallelize([(0,'SUN',0),(0,'SUN',1)]),['index','status','timeSlot'])
rdf = df.distinct()
>>> df2.join(rdf).show()
+-----+------+--------+---------+
|index|status|timeSlot|utilityId|
+-----+------+--------+---------+
| 0| SUN| 0| 101|
| 0| SUN| 0| 102|
| 0| SUN| 1| 101|
| 0| SUN| 1| 102|
+-----+------+--------+---------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Fill the data based on the date ranges in spark - scala

Related

Get % of rows that have a unique value by id

after joining two dataframes pick all columns from one dataframe on basis of primary key

Set literal value over Window if condition suited Spark Scala

how to join dataframes with some similar values and multiple keys / scala

Pass Distinct value of one Dataframe into another Dataframe

Categories

Resources