inner join in pyspark - pyspark

I have a pyspark data frame(df1) which consist of 10K rows and the data frame looks like -
id mobile_no value
1 1111111111 .43
2 2222222222 .54
3 3333333333 .03
4 4444444444 .22
another pyspark data frame (df2) consist of 100k records and looks like -
mobile_no gender
912222222222 M
914444444444 M
919999999999 F
915555555555 M
918888888888 F
I want inner join using pyspark where final data frame looks like -
mobile_no value gender
2222222222 .54 M
4444444444 .22 M
the length of the mobile_no in df2 is 12 but in df1 is 10. I can join it but it's costly operation.
Any help using pyspark?
common_cust = spark.sql("SELECT mobile_number, age \
FROM df1 \
WHERE mobile_number IN (SELECT DISTINCT mobile_number FROM df2)")

One way could be to use substring function on df2 to keep only the last 10 digits to get the same length than in df1:
import pyspark.sql.functions as F
ddf2.select(F.substring('mobile_no', 3, 10).alias('mobile_no'),'gender').show()
+----------+------+
| mobile_no|gender|
+----------+------+
|2222222222| M|
|4444444444| M|
|9999999999| F|
|5555555555| M|
|8888888888| F|
+----------+------+
Then you just need to do a inner join to get your expected output:
common_cust = df1.select('mobile_no', 'value')\
.join( df2.select(F.substring('mobile_no', 3, 10).alias('mobile_no'),'gender'),
on=['mobile_no'], how='inner')
common_cust.show()
+----------+-----+------+
| mobile_no|value|gender|
+----------+-----+------+
|2222222222| 0.54| M|
|4444444444| 0.22| M|
+----------+-----+------+
If you want to use a spark.sql, I guess you can do it like this:
common_cust = spark.sql("""select df1.mobile_no, df1.value, df2.gender
from df1
inner join df2
on df1.mobile_no = substring(df2.mobile_no, 3, 10)""")

Related

Databricks: how to convert Spark dataframe under %r to dataframe under %python

I found some tips about converting a pyspark dataframe to R, but I need to perform the opposite task: convert a R dataframe to pyspark
Anyone knows how to do it?
You can use the same approach as for other languages - use createOrReplaceTempView function to register your dataframe, and then use spark.sql from another language to access its content.
For example. If R side looks as following:
%r
library(SparkR)
id <- c(rep(1, 3), rep(2, 3), 3)
desc <- c('New', 'New', 'Good', 'New', 'Good', 'Good', 'New')
df <- data.frame(id, desc)
df <- createDataFrame(df)
createOrReplaceTempView(df, "test_df")
head(df)
id desc
1 1 New
2 1 New
3 1 Good
4 2 New
5 2 Good
6 2 Good
then you can access these data from Python:
df = spark.sql("select * from test_df")
df.show()
+---+----+
| id|desc|
+---+----+
|1.0| New|
|1.0| New|
|1.0|Good|
|2.0| New|
|2.0|Good|
|2.0|Good|
|3.0| New|
+---+----+

Spark: how to group rows into a fixed size array?

I have a dataset that looks like this:
+---+
|col|
+---+
| a|
| b|
| c|
| d|
| e|
| f|
| g|
+---+
I want to reformat this dataset so that I aggregate the rows into a arrays of fixed length, like so:
+------+
| col|
+------+
|[a, b]|
|[c, d]|
|[e, f]|
| [g]|
+------+
I tried this:
spark.sql("select collect_list(col) from (select col, row_number() over (order by col) row_number from dataset) group by floor(row_number/2)")
But the problem with this is that my actual dataset is too large to process in a single partition for row_number()
As you wish to distribute this, there are a couple of steps necessary.
In case, you wish to run the code, I am starting from this:
var df = List(
"a", "b", "c", "d", "e", "f", "g"
).toDF("col")
val desiredArrayLength = 2
First, split tyour dataframe into a small one which you can process on single node, and larger one which has number of rows which is multiple of size of desired array (in your example, this is 2)
val nRowsPrune = 1 //number of rows to prune such that remaining dataframe has number of
// rows is multiples of the desired length of array
val dfPrune = df.sort(desc("col")).limit(nRowsPrune)
df = df.join(dfPrune,Seq("col"),"left_anti") //separate small from large dataframe
By construction, you can apply the original code on the small dataframe,
val groupedPruneDf = dfPrune//.withColumn("g",floor((lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
//.groupBy("g")
.agg( collect_list("col").alias("col"))
.select("col")
Now, we need to figure a way to deal with the remaining large dataframe. However, now we made sure, that df has a number of rows which is a multiple of the array size.
This is where we use a great trick, which is repartitioning using repartitionByRange. Basically, the partitioning guarantees to preserve the sorting and as you are partitioning each partition will have same size.
You can now, collect each array within each partition,
val nRows = df.count()
val maxNRowsPartition = desiredArrayLength //make sure its a multiple of desired array length
val nPartitions = math.max(1,math.floor(nRows/maxNRowsPartition) ).toInt
df = df.repartitionByRange(nPartitions, $"col".desc)
.withColumn("partitionId",spark_partition_id())
val w = Window.partitionBy($"partitionId").orderBy("col")
val groupedDf = df
.withColumn("g", floor( (lit(-1)+row_number().over(w))/lit(desiredArrayLength ))) //added -1 as row-number starts from 1
.groupBy("partitionId","g")
.agg( collect_list("col").alias("col"))
.select("col")
Finally combining the two results yields what you are looking for,
val result = groupedDf.union(groupedPruneDf)
result.show(truncate=false)

pyspark, get rows where first column value equals id and second column value is between two values, do this for each row in a dataframe

So I have one pyspark dataframe like so, let's call it dataframe a:
+-------------------+---------------+----------------+
| reg| val1| val2 |
+-------------------+---------------+----------------+
| N110WA| 1590030660| 1590038340000|
| N876LF| 1590037200| 1590038880000|
| N135MH| 1590039060| 1590040080000|
And another like this, let's call it dataframe b:
+-----+-------------+-----+-----+---------+----------+---+----+
| reg| postime| alt| galt| lat| long|spd| vsi|
+-----+-------------+-----+-----+---------+----------+---+----+
|XY679|1590070078549| 50| 130|18.567169|-69.986343|132|1152|
|HI949|1590070091707| 375| 455| 18.5594|-69.987804|148|1344|
|JX784|1590070110666| 825| 905|18.544968|-69.990414|170|1216|
Is there some way to create a numpy array or pyspark dataframe, where for each row in dataframe a, all the rows in dataframe b with the same reg and postime between val 1 and val 2, are included?
You can try the below solution -- and let us know if works or anything else is expected ?
I have modified the imputes a little in order to showcase the working solution--
Input here
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',1590030660,1590038340000), ('N110WA',1590070078549,1590070078559)],[ "reg","val1","val2"])
df_b = spark.createDataFrame([('N110WA',1590070078549)],[ "reg","postime"])
df_a.show()
df_a
+------+-------------+-------------+
| reg| val1| val2|
+------+-------------+-------------+
|N110WA| 1590030660|1590038340000|
|N110WA|1590070078549|1590070078559|
+------+-------------+-------------+
df_b
+------+-------------+
| reg| postime|
+------+-------------+
|N110WA|1590070078549|
+------+-------------+
Solution here
from pyspark.sql import types as T
from pyspark.sql import functions as F
#df_a = df_a.join(df_b,'reg','left')
df_a = df_a.withColumn('condition_col', F.when(((F.col('postime') >= F.col('val1')) & (F.col('postime') <= F.col('val2'))),'1').otherwise('0'))
df_a = df_a.filter(F.col('condition_col') == 1).drop('condition_col')
df_a.show()
Final Output
+------+-------------+-------------+-------------+
| reg| val1| val2| postime|
+------+-------------+-------------+-------------+
|N110WA|1590070078549|1590070078559|1590070078549|
+------+-------------+-------------+-------------+
Yes, assuming df_a and df_b are both pyspark dataframes, you can use an inner join in pyspark:
delta = val
df = df_a.join(df_b, [
df_a.res == df_b.res,
df_a.posttime <= df_b.val1 + delta,
df_a.posttime >= df_b.val2 - delta
], "inner")
Will filter out the results to only include the ones specified

How to filter a dataframe by multiple columns?

I have a problem as below:
I have two dataframes
Dataframe DF1:
ID, Name, age
1 name1 18
2 name2 20
DataFrame DF2:
ID, Name, age
1 name1 18
3 name3 19
I am attempting to filter DF2 to exclude records contained in DF1 by ID and Name so that I can get new DF2 like
ID, Name, age
3 name3 19
and then union these two dataframes to get final result:
ID, Name, age
1 name1 18
2 name2 20
3 name3 19
To do this in T-SQL, I can write a statement like
INSERT INTO DF1
SELECT ID, Name, age FROM DF2 WHERE NOT EXISTS
(SELECT 1 FROM DF1 WHERE DF1.ID = DF2.ID AND DF1.Name = DF2.Name)
But I find that "insert" is not supported in dataframe in sparkSQL.
So my questions are:
How can I filter a dataframe based on multiple column?
How can I union two dataframes together?
I am appreciate for any solution.
UNION followed by DISTINCT
Assuming that the records are unique the simplest way to achieve what you want is to take UNION and follow it by DISTINCT:
val df1 = Seq((1, "name1", 18), (2, "name2", 20)).toDF("ID", "Name", "age")
val df2 = Seq((1, "name1", 18), (3, "name3", 19)).toDF("ID", "Name", "age")
df1.unionAll(df2).distinct.show
// +---+-----+---+
// | ID| Name|age|
// +---+-----+---+
// | 1|name1| 18|
// | 2|name2| 20|
// | 3|name3| 19|
// +---+-----+---+
Characteristics:
has to access df1 only once
shuffles both df1 and df2 independent of the size
EXCEPT followed by UNION
Another approach is to use EXCEPT followed by UNION:
df1.unionAll(df2.except(df1)).show // df2.distinct.except to drop duplicates
// +---+-----+---+
// | ID| Name|age|
// +---+-----+---+
// | 1|name1| 18|
// | 2|name2| 20|
// | 3|name3| 19|
// +---+-----+---+
Properties:
has to access df1 twice
shuffles both frames independent of the size (?)
can be used with three frames (df3.unionAll(df2.except(df1)))
LEFT OUTER JOIN followed by SELECT with filter followed by UNION
Finally if you want only partial match LEFT OUTER JOIN with filter followed by UNION should do the trick:
df2.as("df2")
.join(
df1.select("id", "name").as("df1"),
// join on id and name
$"df1.id" === $"df2.id" && $"df1.name" === $"df2.name",
"leftouter")
// This could be replaced by .na.drop(...)
.where($"df1.id".isNull && $"df1.Name".isNull)
.select($"df2.id", $"df2.name", $"df2.age")
.unionAll(df1)
.show
// ---+-----+---+
// | ID| Name|Age|
// +---+-----+---+
// | 3|name3| 19|
// | 1|name1| 18|
// | 2|name2| 20|
// +---+-----+---+
Properties:
has to access df1 twice
if one the data frames is small enough to broadcasted it may not require shuflle
can be used with three data frames

How to avoid duplicate columns after join?

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.