range joins in pyspark - pyspark

Given two dataframes, I need to filter the records in df2 based on the the ranges for similar ids in df1. I was wondering if there is a better (faster) way than the naive approach shown below. In my use case, df1 has 100million records and df2 has over a billion records.
df1 = spark.createDataFrame(pd.DataFrame([["A",10,20],["B",5,8]],
columns=["id","start_dt_int","end_dt_int"]))
df2 = spark.createDataFrame(pd.DataFrame([["A",15],["A",25],["B",9]],
columns=["id","enc_dt_int"]))
comb = [df1.id==df2.id, df1.start_dt_int<=df2.enc_dt_int, df2.enc_dt_int<=df1.end_dt_int]
df2.join(df1, on=comb, how="leftsemi").show()

Lets try range join using spark sql
create database tables
df2.createOrReplaceTempView('df2')
df1.createOrReplaceTempView('df1')
Range join and then drop unwanted columns
spark.sql("""SELECT *
FROM df2
JOIN df1 ON (df2.id = df1.id)
and df2.enc_dt_int BETWEEN df1.start_dt_int AND df1.end_dt_int""").select([df1.id, 'enc_dt_int']).show()
Output
+---+----------+
| id|enc_dt_int|
+---+----------+
| A| 15|
+---+----------+

Related

How to concat two dataframes in which one is having record and other one is empty in pyspark?

I need help to concat two dataframes in which one is empty and other one having the data. Could you please how to do this in pyspark?
pandas I am using:
suppose df2 is empty and df1 is having some record.
df2 = pd.concat([df2, df1])
But how to perform this operation in pyspark?
df1:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
df2:
++
||
++
++
I tried many option. One option worked for me.
For concat df2 to df1, first I need to create the structure of df2 same like df1 then use the union for concatanation.
df2 = sqlContext.createDataFrame(sc.emptyRDD(), df1.schema)
df2 = df2.union(df1)
result:
df2:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
You can use the union method:
df = df1.union(df2)

How to filter a dataframe based on multiple column matches in other dataframes in Spark Scala

Say I have three dataframes as follows:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
Here is tabular view:
I want to filter df2 to only the rows where sport1 and sport2 combinations are valid rows of df1. For example, since in df1, sport1 -> Run, sport2 -> Run is a valid row, it would return that as one of the rows from df2. It would not return sport1 -> Bike, sport2 -> Bike from df2 though. And it would not factor in what the 'name' column value is at all.
The expected result I'm looking for is the dataframe with the following data:
+-------+------+------+
|name |sport1|sport2|
+-------+------+------+
|kevin |Run |Run |
|anthony|Fish |Fish |
+-------+------+------+
Thanks and have a great day!
Try this,
val res = df3.intersect(df1).union(df3.intersect(df2))
+------+------+
|sport1|sport2|
+------+------+
| Run| Run|
| Fish| Fish|
| Swim| Fish|
+------+------+
To filter a dataframe based on multiple column matches in other dataframes, you can use join:
df2.join(df1.select("sport1", "sport2"), Seq("sport1", "sport2"))
As by default join is an inner join, you will keep only the lines where "sport1" and "sport2" are the same in the two dataframes. And as we use a list of columns Seq("sport1", "sport2") for the join condition, the columns "sport1" and "sport2" will not be duplicated
With your example's input data:
val df1 = Seq(("steve","Run","Run"),("mike","Swim","Swim"),("bob","Fish","Fish")).toDF("name","sport1","sport2")
val df2 = Seq(("chris","Bike","Bike"),("dave","Bike","Fish"),("kevin","Run","Run"),("anthony","Fish","Fish"),("liz","Swim","Fish")).toDF("name","sport1","sport2")
You get:
+------+------+-------+
|sport1|sport2|name |
+------+------+-------+
|Run |Run |kevin |
|Fish |Fish |anthony|
+------+------+-------+

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.
I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.
df1:
ID col1 col2
1 AA 2019
2 B 2018
df2:
ID col1 col2
1 A 2019
3 C 2017
I need the following output:
df1:
ID col1 col2
1 AA 2019
2 B 2018
3 C 2017
How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.
Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:
df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// | 1| AA|2019|
// | 2| B|2018|
// | 3| C|2017|
// +---+---+----+
One way to do this is, unioning the dataframes with an identifier column that specifies the dataframe and use it thereafter for prioritizing row from df1 with a function like row_number.
PySpark SQL solution shown here.
from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()
A solution with a left join on df1 could be a lot simpler, except that you have to write multiple coalesces.

How to combine several Dataframes together in scala?

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?
val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))
To combine, I am trying this, but it doesn't work:
var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()
It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.
df2 looks like this:
+---------------+
| _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+
and df looks like this:
+-----------+
|_valuegiven|
+-----------+
| qwe|
| dfdfrt|
| dfdf|
+-----------+
df3 and df4 are also in same format. I want like below dataframe:
+-----------+---------------+
|_valuegiven| _title|
+-----------+---------------+
| qwe|_CONFIG_CONTEXT|
| dfdfrt|_CONFIG_CONTEXT|
| dfdf|_CONFIG_CONTEXT|
+-----------+---------------+
I used this:
val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()
But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.
It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.
val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show
If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.
If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.
You can select multiple columns from a dataframe by providing additional column names in the select function.
val newdf = xmldf.select(
col("UserData.UserValue._valueRef"),
col("UserData.UserValue._title"),
col("author"),
col("price"))
newdf.show()
So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

How to merge two columns into a new DataFrame?

I have two DataFrames (Spark 2.2.0 and Scala 2.11.8). The first DataFrame df1 has one column called col1, and the second one df2 has also 1 column called col2. The number of rows is equal in both DataFrames.
How can I merge these two columns into a new DataFrame?
I tried join, but I think that there should be some other way to do it.
Also, I tried to apply withColumm, but it does not compile.
val result = df1.withColumn(col("col2"), df2.col1)
UPDATE:
For example:
df1 =
col1
1
2
3
df2 =
col2
4
5
6
result =
col1 col2
1 4
2 5
3 6
If that there's no actual relationship between these two columns, it sounds like you need the union operator, which will return, well, just the union of these two dataframes:
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.union(df2).show
+---+
|one|
+---+
| a |
| b |
| c |
| d |
| e |
| f |
+---+
[edit]
Now you've made clear that you just want two columns, then with DataFrames you can use the trick of adding a row index with the function monotonically_increasing_id() and joining on that index value:
import org.apache.spark.sql.functions.monotonically_increasing_id
var df1 = Seq("a", "b", "c").toDF("one")
var df2 = Seq("d", "e", "f").toDF("two")
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), Seq("id"))
.drop("id")
.show
+---+---+
|one|two|
+---+---+
| a | d |
| b | e |
| c | f |
+---+---+
As far as I know, the only way to do want you want with DataFrames is by adding an index column using RDD.zipWithIndex to each and then doing a join on the index column. Code for doing zipWithIndex on a DataFrame can be found in this SO answer.
But, if the DataFrames are small, it would be much simpler to collect the two DFs in the driver, zip them together, and make the result into a new DataFrame.
[Update with example of in-driver collect/zip]
val df3 = spark.createDataFrame(df1.collect() zip df2.collect()).withColumnRenamed("_1", "col1").withColumnRenamed("_2", "col2")
Depends in what you want to do.
If you want to merge two DataFrame you should use the join. There are the same join's types has in relational algebra (or any DBMS)
You are saying that your Data Frames just had one column each.
In that case you might want todo a cross join (cartesian product) with give you a two columns table of all possible combination of col1 and col2, or you might want the uniao (as referred by #Chondrops) witch give you a one column table with all elements.
I think all other join's types uses can be done specialized operations in spark (in this case two Data Frames one column each).