Merge two tables respecting to dates (date & period) using pyspark - pyspark

Is there a way to merge two tables in pyspark - respect to a date, one presenting events linked to a date, and an other one presenting some other informations, presenting a period with a start and an end date ?
There is similar topics on python, but non on pyspark, like presented (using numpy) in this answer. My idea would not to get only one information but the complete available information in my right table.
In this example, I would get in df1, based on the id, all available information in df2 for this id, respecting the event_date including in the start_period and the end_period.
df1 = spark.createDataFrame([
(1,'a', datetime.datetime(2021,1,1)),
(1,'b', datetime.datetime(2021,1,5)),
(1,'c', datetime.datetime(2021,1,24)),
(2,'d', datetime.datetime(2021,1,10)),
(2,'e' , datetime.datetime(2021,1,15))], ['id','event','event_date'])
df2 = spark.createDataFrame([
(1,'Xxz45','XX013', datetime.datetime(2021,1,1), datetime.datetime(2021,1,10)),
(1,'Xasz','XX014', datetime.datetime(2021,1,11), datetime.datetime(2021,1,22)),
(1,'Xbbd','XX015', datetime.datetime(2021,1,23), datetime.datetime(2021,1,26)),
(1,'Xaaq','XX016', datetime.datetime(2021,1,27), datetime.datetime(2021,1,31))], ['id','info1','info2','start_period', 'end_period'])
[EDIT] The expected output would be (merging on id and on the event_date included in the period):
df_results = spark.createDataFrame([
(1, 'a', datetime.datetime(2021,1,1),'Xxz45','XX013'),
(1, 'b', datetime.datetime(2021,1,5),'Xxz45','XX013'),
(1, 'c', datetime.datetime(2021,1,24),'Xbbd','XX015'),
(2, 'd', datetime.datetime(2021,1,10), NA, NA),
(2, 'e' , datetime.datetime(2021,1,15), NA, NA)], ['id','event','event_date','info1','info2'])

You can left join df1 with df2 with condition start_period <= event_date <= end_period
from pyspark.sql import functions as F
(df1
.join(df2, on=[df1['id'] == df2['id'], (df1['event_date'] >= df2['start_period']) & (df1['event_date'] <= df2['end_period'])], how='left')
.drop(df2['id'])
.drop('start_period', 'end_period')
.show()
)
# Output
# +---+-----+-------------------+-----+-----+
# | id|event| event_date|info1|info2|
# +---+-----+-------------------+-----+-----+
# | 1| a|2021-01-01 00:00:00|Xxz45|XX013|
# | 1| b|2021-01-05 00:00:00|Xxz45|XX013|
# | 1| c|2021-01-24 00:00:00| Xbbd|XX015|
# | 2| d|2021-01-10 00:00:00| null| null|
# | 2| e|2021-01-15 00:00:00| null| null|
# +---+-----+-------------------+-----+-----+

What you can do is write an UDF that creates a new column in df2 from start_period and end_period, with values like
[
datetime.datetime(2021,1,1),
datetime.datetime(2021,1,2),
datetime.datetime(2021,1,3),
datetime.datetime(2021,1,4),
datetime.datetime(2021,1,5),
datetime.datetime(2021,1,6),
datetime.datetime(2021,1,7),
datetime.datetime(2021,1,8),
datetime.datetime(2021,1,9),
datetime.datetime(2021,1,10)
]
After that you can explode this column and get a row for every date in the list. Finally, you can do an ordinary join between df1 and df2.
I did not check whether there is any pushdown function to create the list of dates from the interval.

Related

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

Ambiguous Column in DataFrame Join - Unable to Alias or Call

Getting into databricks from a SQL background and working with some dataframe samples for joining for basic transformations, and I am having issues isolating the correct dataframe.column for other transformations after the join.
For DF1, I have 3 columns: user_id, user_ts, email. For DF2, I have two columns: email, converted.
Below is how I have the logic for the join. This works and returns 5 columns; however, there are two email columns in the schema
df3 = (df1
.join(df2, df1.email == df2.email, "outer")
)
I am trying to do some basic transformation on the df2 email as part the dataframe string, but I receive the error:
"Cannot resolve column name "df2.email" among (user_id, user_ts, email, email, converted)"
df3 = (df1
.join(df2, df1.email == df2.email, "outer")
.na.fill(False,["df2.email"])
)
If I remove the df2 from the fill(), I get the error that the columns are ambiguous.
How can I define which column I want to do a transformation on if it has the same column name as a second column. In SQL, I just use a table alias predicate for the column, but this doesn't seem to be how pyspark is bested used.
Suggestions?
If you want to avoid both key columns in the join result and get combined result then you can pass list of key columns as an argument to join() method.
If you want to retain same key columns from both dataframes then you have to rename one of the column name before doing transformation, otherwise spark will throw ambiguous column error.
df1 = spark.createDataFrame([(1, 'abc#gmail.com'),(2,'def#gmail.com')],["id1", "email"])
df2 = spark.createDataFrame([(1, 'abc#gmail.com'),(2,'ghi#gmail.com')],["id2", "email"])
df1.join(df2,['email'], 'outer').show()
'''
+-------------+----+----+
| email| id1| id2|
+-------------+----+----+
|def#gmail.com| 2|null|
|ghi#gmail.com|null| 2|
|abc#gmail.com| 1| 1|
+-------------+----+----+'''
df1.join(df2,df1['email'] == df2['email'], 'outer').show()
'''
+----+-------------+----+-------------+
| id1| email| id2| email|
+----+-------------+----+-------------+
| 2|def#gmail.com|null| null|
|null| null| 2|ghi#gmail.com|
| 1|abc#gmail.com| 1|abc#gmail.com|
+----+-------------+----+-------------+'''
df1.join(df2,df1['email'] == df2['email'], 'outer') \
.select('id1', 'id2', df1['email'], df2['email'].alias('email2')) \
.na.fill('False','email2').show()
'''
+----+----+-------------+-------------+
| id1| id2| email| email2|
+----+----+-------------+-------------+
| 2|null|def#gmail.com| False|
|null| 2| null|ghi#gmail.com|
| 1| 1|abc#gmail.com|abc#gmail.com|
+----+----+-------------+-------------+ '''

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053', example:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
I can't figure out what the error is or how to fix these weeks. How can I convert weeks 202001 and 202053 to a valid date?
Dealing with ISO week in Spark is indeed a headache - in fact this functionality was deprecated (removed?) in Spark 3. I think using Python datetime utilities within a UDF is a more flexible way to do this.
import datetime
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+
Based on mck's answer this is the solution I ended up using for Python version 3.5.2 :
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
Without the use of the in 3.6 added '%G%V%u' I had to subtract a week from the date to get the correct dates.
The following will not use udf, but instead, a more efficient vectorized pandas_udf:
import pandas as pd
#F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.

How to avoid duplicate columns after join?

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.