Is left join and right_outer join the same if the tables are positioned differently, in pyspark? - pyspark

I have 2 dataframes in PySpark,
df1 = spark.createDataFrame([
("s1", "artist1"),
("s2", "artist2"),
("s3", "artist3"),
],
['song_id', 'artist'])
df1.show()
df2 = spark.createDataFrame([
("s1", "2"),
("s1", "3"),
("s4", "4"),
("s4", "5")
],
['song_id', 'duration'])
df2.show()
Output:
+-------+-------+
|song_id| artist|
+-------+-------+
| s1|artist1|
| s2|artist2|
| s3|artist3|
+-------+-------+
+-------+-----+
|song_id|col_2|
+-------+-----+
| s1| hmm|
| s1| hmmm|
| s4| acha|
| s4| ohoo|
+-------+-----+
I apply right_outer and left join on these 2 dataframes and they both seem to give me the same result-
df1.join(df2, on="song_id", how="right_outer").show()
df2.join(df1, on="song_id", how="left").show()
Output:
+-------+-------+--------+
|song_id| artist|duration|
+-------+-------+--------+
| s1|artist1| 2|
| s1|artist1| 3|
| s4| null| 4|
| s4| null| 5|
+-------+-------+--------+
+-------+--------+-------+
|song_id|duration| artist|
+-------+--------+-------+
| s1| 2|artist1|
| s1| 3|artist1|
| s4| 4| null|
| s4| 5| null|
+-------+--------+-------+
I am not sure how to use these 2 joins effectively.
What is the difference between these 2 joins?

The left and right joins gives result based on the order of table respective to join keyword.
Left/leftouter/left_outer joins are all same and shows the whole left table and the matching records of the right table.
Right/rightouter/right_outer joins are all same and shows the whole right table and the matching records of the left table.
In the code
df1.join(df2, on="song_id", how="right_outer").show()
df1 is the left table(dataframe) and df2 is the right table and join type is right_outer, hence it shows all the rows of df2 and matching rows of the df1.
Similarly in
df2.join(df1, on="song_id", how="left").show()
df2 is the left table and df1 is the right table and the join type is left, so it shows all records of df2 and matching records of df1.
Hence both code shows the same result.
df1.join(df2, on="song_id", how="right_outer").show()
df1.join(df2, on="song_id", how="left").show()
In the above code, I have placed df1 as left table in both queries.
And here is the result:-
song_id
artist
duration
s1
artist1
2
s1
artist1
3
s4
null
4
s4
null
5
song_id
artist
duration
s1
artist1
2
s1
artist1
3
s2
artist2
null
s3
artist3
null
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join
You can use this for reference.

Related

How to select elements of a column of a dataframe with respect to a column of the another dataframe?

How I can use two dataframes, and select elements of df2, if a column in df1 is included in a column in df2and NA otherwise.
df2:
name
summer
winter
water
play
df1:
col1
play ground
winter cold
something
work
output:
col1 name
play ground play
winter cold winter
something NA
work NA
#Create match column
df1 = df1.alias('df1').withColumn('col_new',explode(split('col1','\s')))
new = (df1.join(df2, how='left',on=df1.col_new==df2.name)#merge on common columns
.drop('col_new')#drop the match column introduced
.orderBy([df2.name.desc(),'name'])#Order the df
.drop_duplicates(['col1'])#eliminate duplicates
).show()
+-----------+------+
| col1| name|
+-----------+------+
|play ground| play|
| something| null|
|winter cold|winter|
| work| null|
+-----------+------+
It is recommended to use the contains condition directly to join.
df = df1.join(df2, on=[df1.col1.contains(df2.name)], how='left')
df.show(truncate=False)
df1 = spark.createDataFrame([("play ground",),("winter cold",),("something",),("work",)], ['col1',])
df2 = spark.createDataFrame([("summer",),("winter",),("play bc",),("play",)], ['name',])
df1 = df1.withColumn('common_word', explode(split(col('col1'), '\s')))
# Also split & explode Column 'name' of df2.
df2 = df2.withColumn('common_word', explode(split(col('name'), '\s')))
(
df1
.join(df2, ['common_word'], "left")
.sort('col1')
.fillna("NA")
.show()
)
+-----------+-----------+-------+
|common_word| col1| name|
+-----------+-----------+-------+
| ground|play ground| NA|
| play|play ground|play bc|
| play|play ground| play|
| something| something| NA|
| cold|winter cold| NA|
| winter|winter cold| winter|
| work| work| NA|
+-----------+-----------+-------+

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

How to filter columns in one table based on the same columns in another table using Spark

I need to filter columns in one table (fixTablehb004_p) based on the same columns in another table (filtredTable109_p)
I first wanted to use this code:
val filtredTablehb004_p = fixTablehb004_p
.where($"servizio_rap" === filtredTable109_p.col("servizio_rap"))
.where($"filiale_rap" === filtredTable109_p.col("filiale_rap"))
.where($"codice_rap" === filtredTable109_p.col("codice_rap"))
But it gave out an error.
Then I tried the code based on this stackoverflow question, and I get this code. But the problem is that there are extra columns, I know what you can do drop(columnName), but I want to ask you if I'm doing it right and if there is another better option
val filtredTablehb004_p = sparkSession.sql("SELECT * FROM fixTablehb004_p " +
"JOIN filtredTable109_p " +
"ON fixTablehb004_p.servizio_rap = filtredTable109_p.servizio_rap AND " +
"fixTablehb004_p.filiale_rap = filtredTable109_p.filiale_rap AND " +
"fixTablehb004_p.codice_rap = filtredTable109_p.codice_rap ")
Let's take 2 sample dataframes and see how we can select required columns or avoid duplicate key column names in joined output dataframe.
USING DATAFRAME API:
val df1 = Seq(("A1", "A2", 1), ("A3", "A4", 2), ("A1", "A3", 3))
.toDF("c1", "c2", "c3")
val df2 = Seq(("A1", "A2", 10), ("A3", "A4", 11))
.toDF("c1", "c2", "c4")
df1.createOrReplaceTempView("tab1")
df2.createOrReplaceTempView("tab2")
If column names which you used for joining condition from both dataframes are same then output dataframe will have duplicate columns. To avoid this you can pass all those columns as Seq to join().
df1.join(df2, Seq("c1", "c2")).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A1| A2| 1| 10|
| A3| A4| 2| 11|
+---+---+---+---+
To select required columns from specific dataframe you can use below syntax:
df1.join(df2, Seq("c1", "c2")).select('c1, 'c2, df1("c3")).show()
// OR
df1.join(df2, df1("c1") === df2("c1") && df1("c2") === df2("c2"))
.select(df1("c1"), df1("c2"), df1("c3")).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| A1| A2| 1|
| A3| A4| 2|
+---+---+---+
USING SQL API:
spark.sql(
"""
|SELECT t2.c1, t2.c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2
|""".stripMargin).show()
//OR
spark.sql(
"""
|SELECT c1, c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 USING(c1, c2)
|""".stripMargin).show()
+---+---+---+
| c1| c2| c4|
+---+---+---+
| A1| A2| 10|
| A3| A4| 11|
+---+---+---+

Comparing DataFrames in Spark

I have 2 dataframes
df1
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9874| 880 |
|2016-04-30| 14|FR | 9875| 13 |
|2017-06-10| 15| PQR| 9867| 57721 |
+----------+----------------+--------------------+--------------+-------------+
df2
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9879| 820 |
|2016-04-30| 14|FR | 9785| 9 |
|2017-06-10| 15| XYZ| 9967| 57771 |
+----------+----------------+--------------------+--------------+-------------+
I want to write a comparator in spark which compares T1, T2 in both dataframes upon WEEK, DIM1, DIM2 with T1, T2 in df1 should be greater than T1, T2 by 3. I want to return all rows which do not match the above criterion with difference between T1, T2 among dataframes. I also want to have rows present in df1 not present in df2 and vice versa for the following combination WEEK, DIM1, DIM2.
The output should be like this
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
| WEEK|DIM1 |DIM2 |T1_dIFF | T2_dIFF | Presenent_In_DF1 | Presenent_In_DF2|
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
|2016-04-30| 14|FR | 90| 4 | Y | Y |
|2017-06-10| 15|PQR | 9867| 57721 | Y | N |
|2017-06-10| 15|XYZ | 9967| 57771 | N | Y |
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
What is the best way to go around this ?
I have implemented the following but do not know how to proceed after this -
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
The joined look like this -
+----------+----+----+----+-----+----+-----+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|
+----------+----+----+----+-----+----+-----+
|2016-04-02| 14|NULL|9874| 880|9879| 820|
|2017-06-10| 15| PQR|9867|57721|null| null|
|2017-06-10| 15| XYZ|null| null|9967|57771|
|2016-04-30| 14| FR|9875| 13|9785| 9|
+----------+----+----+----+-----+----+-----+
I do not know how to proceed after this in a good way, relatively new to scala.
One easy solution could be to join df1 and df2 with the WEEK as unique Key. In the joined data you need to keep all the columns from df1 and df2.
Then you can do a map operation on the dataframe to produce the rest of the columns.
Something like
df1.createOrReplaceTempTable("df1")
df2.createOrReplaceTempTable("df2")
val df = spark.sql("select df1.*, df2.DIM1 as df2_DIM1, df2.DIM2 as df2_DIM2, df2.T1 as df2_T1, df2.T2 as df2_T2 from df1 join df2 on df1.WEEK = df2.WEEK")
// Now map on the dataframe to produce the diff dataframe
// Or you can use the SQL to do that.

How to pivot dataset?

I use Spark 2.1.
I have some data in a Spark Dataframe, which looks like below:
**ID** **type** **val**
1 t1 v1
1 t11 v11
2 t2 v2
I want to pivot up this data using either spark Scala (preferably) or Spark SQL so that final output should look like below:
**ID** **t1** **t11** **t2**
1 v1 v11
2 v2
You can use groupBy.pivot:
import org.apache.spark.sql.functions.first
df.groupBy("ID").pivot("type").agg(first($"val")).na.fill("").show
+---+---+---+---+
| ID| t1|t11| t2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note: depending on the actual data, i.e. how many values there are for each combination of ID and type, you might choose a different aggregation function.
Here's one way to do it:
val df = Seq(
(1, "T1", "v1"),
(1, "T11", "v11"),
(2, "T2", "v2")
).toDF(
"id", "type", "val"
).as[(Int, String, String)]
val df2 = df.groupBy("id").pivot("type").agg(concat_ws(",", collect_list("val")))
df2.show
+---+---+---+---+
| id| T1|T11| T2|
+---+---+---+---+
| 1| v1|v11| |
| 2| | | v2|
+---+---+---+---+
Note that if there are different vals associated with a given type, they will be grouped (comma-delimited) under the type in df2.
This one should work
val seq = Seq((123,"2016-01-01","1"),(123,"2016-01-02","2"),(123,"2016-01-03","3"))
val seq = Seq((1,"t1","v1"),(1,"t11","v11"),(2,"t2","v2"))
val df = seq.toDF("id","type","val")
val pivotedDF = df.groupBy("id").pivot("type").agg(first("val"))
pivotedDF.show
Output:
+---+----+----+----+
| id| t1| t11| t2|
+---+----+----+----+
| 1| v1| v11|null|
| 2|null|null| v2|
+---+----+----+----+