how join two dataframe in pyspark with a special condition? - pyspark

I have 2 dataframes I wanna join them with respect of 3 column. The first column have an exact match the second column has an exact match if the value of df1 is not NA. but third column is not an exact match it is inclusion:
join rows of df1 with df2 if col1 in df1 and col1 in df2 have the same values and elements of col3 in df1 is included in elements of col3 in df2 and if col1 in df1 is not NA and have the same value in col2 in df2 (if it is NA but other 2 conditions are valid then match them ).
Example
df1:
col1 col2 col3
us NA amazon
ca Vancouver Facebook
IN Ottawa IBM
df2:
col1 col2 col3
ca Vancouver /n Facebook us
IN Boston IBM
us new york amazon IN
output:
ca Vancouver Facebook
us new york amazon

from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[
('us',None,'amazon')
,('ca','Vancouver','Facebook')
,('IN','Ottawa','IBM')
],
['col1','col2','col3']
)
df2 = spark.createDataFrame(
[
('ca','Vancouver','/n Facebook us')
,('IN','Boston','IBM')
,('us','new york','amazon IN')
],
['col1','col2','col3']
)
df1.show()
df2.show()
# +----+---------+--------+
# |col1| col2| col3|
# +----+---------+--------+
# | us| null| amazon|
# | ca|Vancouver|Facebook|
# | IN| Ottawa| IBM|
# +----+---------+--------+
# +----+---------+--------------+
# |col1| col2| col3|
# +----+---------+--------------+
# | ca|Vancouver|/n Facebook us|
# | IN| Boston| IBM|
# | us| new york| amazon IN|
# +----+---------+--------------+
df1\
.join(df2, 'col1')\
.withColumn('flag', F.when(df1.col2.isNull() & df2.col3.contains(df1.col3) , 1).
when(df1.col2 == df2.col2, 1).otherwise(0))\
.filter(F.col('flag') == 1)\
.select(df1.col1, df2.col2, df1.col3)\
.show()
# +----+---------+--------+
# |col1| col2| col3|
# +----+---------+--------+
# | ca|Vancouver|Facebook|
# | us| new york| amazon|
# +----+---------+--------+

Related

filter record in dataframe base on list of value

I have below scenario.
li = ['g1','g2','g3']
df1 = id name goal
1 raj g1
2 harsh g3/g1
3 ramu g1
Above as you can see dataframe df1 and list li
i wanted to filter record in df1 base on list values of li but you can see in goal column first we need to split value base of / del but getting error
df1 = df1.filter(~df1.goal.isin(li))
but this is not returning any record...
is there any way to get record
Using this exemple:
from pyspark.sql import functions as F
from pyspark.sql.types import *
li = ['g1','g2','g3']
df1 = spark.createDataFrame(
[
('1','raj','g1'),
('2','harsh','g3/g1'),
('3','ramu','g1'),
('4','luiz','g2/g4')
],
["id", "name", "goal"]
)
df1.show()
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# | 4| luiz|g2/g4|
# +---+-----+-----+
You can use split to split the goal column and then array_except to find which records are not in your list:
result = df1\
.withColumn('goal_split', F.split(F.col('goal'), "/"))\
.withColumn('li', F.array([F.lit(x) for x in li]))\
.withColumn("test",F.array_except('goal_split','li'))\
.filter(F.col('test') == F.array([]))\
result.show()
# +---+-----+-----+----------+------------+----+
# | id| name| goal|goal_split| li|test|
# +---+-----+-----+----------+------------+----+
# | 1| raj| g1| [g1]|[g1, g2, g3]| []|
# | 2|harsh|g3/g1| [g3, g1]|[g1, g2, g3]| []|
# | 3| ramu| g1| [g1]|[g1, g2, g3]| []|
# +---+-----+-----+----------+------------+----+
Than, select the columns you want for the result:
result.select('id', 'name', 'goal').show().
# +---+-----+-----+
# | id| name| goal|
# +---+-----+-----+
# | 1| raj| g1|
# | 2|harsh|g3/g1|
# | 3| ramu| g1|
# +---+-----+-----+

merge rows in a dataframe by id trying to avoid null values in columns (Spark scala)

I am developing in Spark scala, and I would like to merge some rows in a dataframe...
My dataframe is the next one:
+-------------------------+-------------------+---------------+------------------------------+
|name |col1 |col2 |col3 |
+-------------------------+-------------------+---------------+------------------------------+
| a | null| null| 0.000000|
| a | 0.000000| null| null|
| b | null| null| 0.000000|
| b | 300.000000| null| null|
+-------------------------+-------------------+---------------+------------------------------+
And I want to turn on the next dataframe:
+-------------------------+-------------------+---------------+------------------------------+
|name |col1 |col2 |col3 |
+-------------------------+-------------------+---------------+------------------------------+
| a | 0.000000| null| 0.000000|
| b | 300.000000| null| 0.000000|
+-------------------------+-------------------+---------------+------------------------------+
Having into account:
-Some column can have all values to null.
-There can be a lot of columns in a dataframe.
As far as I know, I have to use the groupBy with the agg(), but I am unable to get the correct expression:
df.groupBy("name").agg()
If "merge" means sum, column list can be received from dataframe schema and included into "agg":
val df = Seq(
("a", Option.empty[Double], Option.empty[Double], Some(0.000000)),
("a", Some(0.000000), Option.empty[Double], Option.empty[Double]),
("b", Option.empty[Double], Option.empty[Double], Some(0.000000)),
("b", Some(300.000000), Option.empty[Double], Option.empty[Double])
).toDF(
"name", "col1", "col2", "col3"
)
val columnsToMerge = df
.columns
.filterNot(_ == "name")
.map(c => sum(c).alias(c))
df.groupBy("name")
.agg(columnsToMerge.head, columnsToMerge.tail: _*)
Result:
+----+-----+----+----+
|name|col1 |col2|col3|
+----+-----+----+----+
|a |0.0 |null|0.0 |
|b |300.0|null|0.0 |
+----+-----+----+----+
You can use groupby('name') as you suggest, and then, ffill() + bfill().
df = df.groupby('name').ffill().bfill().drop_duplicates(keep='first')
If you want to keep the name column you can use pandas update():
df.update(df.groupby('name').ffill().bfill())
df.drop_duplicates(keep='first', inplace=True)
Result df:
name
col1
col2
col3
a
0
0
b
300
0

Merge many dataframes into one in Pyspark [non pandas df]

I will be getting dataframes generated one by one through a process. I have to merge them into one.
+--------+----------+
| Name|Age |
+--------+----------+
|Alex | 30|
+--------+----------+
+--------+----------+
| Name|Age |
+--------+----------+
|Earl | 32|
+--------+----------+
+--------+----------+
| Name|Age |
+--------+----------+
|Jane | 15|
+--------+----------+
Finally:
+--------+----------+
| Name|Age |
+--------+----------+
|Alex | 30|
+--------+----------+
|Earl | 32|
+--------+----------+
|Jane | 15|
+--------+----------+
Tried many options like concat, merge, append but all are I guess pandas libraries. I am not using pandas. Using version python 2.7 and Spark 2.2
Edited to cover final scenario with foreachpartition:
l = [('Alex', 30)]
k = [('Earl', 32)]
ldf = spark.createDataFrame(l, ('Name', 'Age'))
ldf = spark.createDataFrame(k, ('Name', 'Age'))
# option 1:
union_df(ldf).show()
#option 2:
uxdf = union_df(ldf)
uxdf.show()
output in both cases:
+-------+---+
| Name|Age|
+-------+---+
|Earl | 32|
+-------+---+
You can use unionAll() for dataframes:
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.union, dfs)
df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))
unionAll(df1, df2, df3).show()
## +---+----+
## | k| v|
## +---+----+
## | 1|foo1|
## | 2|bar1|
## | 3|foo2|
## | 4|bar2|
## | 5|foo3|
## | 6|bar3|
## +---+----+
EDIT:
You can create an empty dataframe, and keep doing a union to it:
# Create first dataframe
ldf = spark.createDataFrame(l, ["Name", "Age"])
ldf.show()
# Save it's schema
schema = ldf.schema
# Create an empty DF with the same schema, (you need to provide schema to create empty dataframe)
empty_df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
empty_df.show()
# Union the first DF with the empty df
empty_df = empty_df.union(ldf)
empty_df.show()
# New dataframe after some operations
ldf = spark.createDataFrame(k, schema)
# Union with the empty_df again
empty_df = empty_df.union(ldf)
empty_df.show()
# First DF ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# Empty dataframe empty_df
+----+---+
|Name|Age|
+----+---+
+----+---+
# After first union empty_df.union(ldf)
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# After second union with new ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
|Earl| 32|
+----+---+

Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe.
I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set
but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.
My data is similar to this
df1:
+---+-----------------+
| id| url |
+---+-----------------+
| 1|google.com |
| 2|facebook.com |
| 3|github.com |
| 4|stackoverflow.com|
+---+-----------------+
df2:
+-----+------------+
| id | urldetail |
+-----+------------+
| 11 |google.com |
| 12 |yahoo.com |
| 13 |facebook.com|
| 14 |twitter.com |
| 15 |youtube.com |
+-----+------------+
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"
+---+------------+-------------+
| id| urldetail | check |
+---+------------+-------------+
| 11|google.com | 1 |
| 12|yahoo.com | 0 |
| 13|facebook.com| 1 |
| 14|twitter.com | 0 |
| 15|youtube.com | 0 |
+---+------------+-------------+
I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!
I have a spark dataframe, and I wish to check whether each string in a
particular column contains any number of words from a pre-defined a
column of another dataframe.
Here is the way. using = or like
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}
object CompareColumns extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local").getOrCreate()
import spark.implicits._
val df1 = Seq(
(1, "google.com"),
(2, "facebook.com"),
(3, "github.com"),
(4, "stackoverflow.com")).toDF("id", "url").as("first")
df1.show
val df2 = Seq(
(11, "google.com"),
(12, "yahoo.com"),
(13, "facebook.com"),
(14, "twitter.com")).toDF("id", "url").as("second")
df2.show
val df3 = df2.join(df1, expr("first.url like second.url"), "full_outer").select(
col("first.url")
, col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
}
Result :
+---+-----------------+
| id| url|
+---+-----------------+
| 1| google.com|
| 2| facebook.com|
| 3| github.com|
| 4|stackoverflow.com|
+---+-----------------+
+---+------------+
| id| url|
+---+------------+
| 11| google.com|
| 12| yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+
+-----------------+-----+
| url|check|
+-----------------+-----+
| google.com| true|
| facebook.com| true|
| github.com|false|
|stackoverflow.com|false|
+-----------------+-----+
with full outer join we can achive this...
For more details see my article with all joins here in my linked in post
Note : Instead of 0 for false 1 for true i have used boolean
conditions here.. you can translate them in to what ever you wanted...
UPDATE : If rows are increasing in second dataframe
you can use this, it wont miss any rows from second
val df3 = df2.join(df1, expr("first.url like second.url"), "full").select(
col("second.*")
, col("first.url").contains(col("second.url")).as("check"))
.filter("url is not null")
df3.na.fill(Map("check" -> false))
.show
Also, one more thing is you can try regexp_extract as shown in below post
https://stackoverflow.com/a/53880542/647053
read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace
val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))
val df2 =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail"))
df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show
+---+-----------------+-------------+
| id| url|contains_bool|
+---+-----------------+-------------+
| 1| google.com| true|
| 2| facebook.com| true|
| 3| github.com| false|
| 4|stackoverflow.com| false|
+---+-----------------+-------------+

Get distinct values of specific column with max of different columns

I have the following DataFrame
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
| B|null|null| 4|
| B|null| 2|null|
| B|null| 1|null|
| A| 4|null|null|
+----+----+----+----+
What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.
This snippet won't do what I want:
df.groupBy("col1").max("col2","col3","col4").show()
And this one just gives the max only for one column (1):
df.groupBy("col1").max("col2").show()
I even tried to merge the single outputs by this:
//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)
where test1 and test2 are DataFramesdone with queries as (1).
So how do I achive this nicely??
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
+----+----+----+----+
Or even only the distinct values:
+----+
|col1|
+----+
| A|
| B|
| C|
+----+
Thanks in advance! Best
You can use some thing like below :-
sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp
where a=x.col2 or b= x.col3 or c=x.col4")
Will give the desired result.
It can be solved like this:
df.registerTempTable("temp")
spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")
spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show