I have been trying to convert the below SAS code into PySpark syntax, but I haven't been able to figure out the dates.
inner join (select var1, max(date) as max_date
from table
group by var1) as recent
on a.var1 = recent.var1 and a.date = recent.date
For self joins in Spark, it's recommended to use alias for both sides.
from pyspark.sql import functions as F
df = (table.alias('a')
.join(
table.groupBy('var1').agg(F.max('date').alias('date')).alias('recent'),
(F.col("a.var1") == F.col("recent.var1")) & (F.col("a.date") == F.col("recent.date")),
'inner')
)
If you don't need duplicated columns, remove them by selecting everything from the first dataframe.
from pyspark.sql import functions as F
df = (table.alias('a')
.join(
table.groupBy('var1').agg(F.max('date').alias('date')).alias('recent'),
(F.col("a.var1") == F.col("recent.var1")) & (F.col("a.date") == F.col("recent.date")),
'inner')
.select('a.*')
)
Related
I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks
The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()
I tried to do Join two dataframes in spark shell.
One of the dataframe is having 15000 records and another is having 14000 rows.
I tried Left outer join and inner join of these dataframes, but result is having count of 29000 rows.
How is that happening?
The code which i tried is given below.
val joineddf = finaldf.as("df1").join(cosmos.as("df2"), $"df1.BatchKey" === $"df2.BatchKey", "left_outer").select(($"df1.*"),col("df2.BatchKey").as("B2"))
val joineddf = finaldf.as("df1").join(cosmos.as("df2"), $"df1.BatchKey" === $"df2.BatchKey", "inner").select(($"df1.*"),col("df2.BatchKey").as("B2"))
Both above methods are resulted in a dataframe where count is sum of both dataframes.
Even I tried the below method, but still getting same result.
finaldf.createOrReplaceTempView("df1")
cosmos.createOrReplaceTempView("df2")
val test = spark.sql("""SELECT df1.*, df2.* FROM df1 LEFT OUTER JOIN df2 ON trim(df1.BatchKey) == trim(df2.BatchKey)""")
If i try to add more condition for join then the no of count is again increasing.
How to get exact result for a left outer join?
here in the case max count should be 15000
Antony,
Can you try performing the join below :
val joineddf = finaldf.join(cosmos.select("BatchKey"), Seq("BatchKey"), "left_outer")
Here I'm not using any alias.
I have two pyspark dataframe with same schema as below -
df_source:
id, name, age
df_target:
id,name,age
"id" is primary column in both the tables and rest are attribute columns
i am accepting primary and attribute column list from user as below-
primary_columns = ["id"]
attribute_columns = ["name","age"]
I need to join above two dataframes dynamically as below -
df_update = df_source.join(df_target, (df_source["id"] == df_target["id"]) & ((df_source["name"] != df_target["name"]) | (df_source["age"] != df_target["age"])) ,how="inner").select([df_source[col] for col in df_source.columns])
how can i achieve this join condition dynamically in pyspark since number of attribute and primary key columns can change as per the user input? Please help.
IIUC, you just can achieve the desired output using an inner join on the primary_columns and a where clause that loops over the attribute_columns.
Since the two DataFrames have the same column names, use alias to differentiate column names after the join.
from functools import reduce
from pyspark.sql.functions import col
df_update = df_source.alias("s")\
.join(df_target.alias("t"), on=primary_columns, how="inner")\
.where(
reduce(
lambda a, b: a|b,
[(col("s."+c) != col("t."+c) for c in attribute_columns]
)\
)
.select("s.*")
Use reduce to apply the bitwise OR operation on the columns in attribute_columns.
I am doing inner join for the in spark dataframes similar coversion of sql query
SELECT DISTINCT a.aid,a.DId,a.BM,a.BY,b.TO FROM GetRaw a
INNER JOIN DF_SD b WHERE a.aid = b.aid AND a.DId= b.DId AND a.BM= b.BM AND a.BY = b.BY"
I am converting as
val Pr = DF_SD.select("aid","DId","BM","BY","TO").distinct()
.join(GetRaw,GetRaw.("aid") <=> DF_SD("aid")
&& GetRaw.("DId") <=> DF_SD("DId")
&& DF_SD,GetRaw.("BM") <=> DF_SD("BM")
&& DF_SD,GetRaw.("BY") <=> DF_SD("BY"))
My Output Table contains columns
"aid","DId","BM","BY","TO","aid","DId","BM","BY"
Can any one correct where I am doing wrong
Just use SELECT of distincts after join:
val Pr = DF_SD.join(GetRaw,Seq("aid","DId","BM","BY"))
.select("aid","DId","BM","BY","TO").distinct
you can mention column names in sequence, which is correct way of handling this problem..
pls see https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
val Pr = DF_SD.join(GetRaw,Seq("aid","DId","BM","BY"))
.dropDuplicates() //optionally, if you want to drop duplicate rows from the dataframe then
Pr.show();
I am doing join of 2 data frames and select all columns of left frame for example:
val join_df = first_df.join(second_df, first_df("id") === second_df("id") , "left_outer")
in above I want to do select first_df.* .How can I select all columns of one frame in join ?
With alias:
first_df.alias("fst").join(second_df, Seq("id"), "left_outer").select("fst.*")
We can also do it with leftsemi join. leftsemi join will select the data from left side dataframe from a joined dataframe.
Here we join two dataframes df1 and df2 based on column col1.
df1.join(df2, df1.col("col1").equalTo(df2.col("col1")), "leftsemi")
Suppose you:
Want to use the DataFrame syntax.
Want to select all columns from df1 but only a couple from df2.
This is cumbersome to list out explicitly due to the number of columns in df1.
Then, you might do the following:
val selectColumns = df1.columns.map(df1(_)) ++ Array(df2("field1"), df2("field2"))
df1.join(df2, df1("key") === df2("key")).select(selectColumns:_*)
Just to add one possibility, whithout using alias, I was able to do that in pyspark with
first_df.join(second_df, "id", "left_outer").select( first_df["*"] )
Not sure if applies here, but hope it helps