I have 2 spark dataframes, and I want to add new column named "seg" to dataframe df2 based on below condition
if df2.colx value is present in df1.colx.
I tried below operation in pyspark but its throwing exception.
cc002 = df2.withColumn('seg',F.when(df2.colx == df1.colx,"True").otherwise("FALSE"))
df1 :
id colx coly
1 678 56789
2 900 67890
3 789 67854
df2
Name colx
seema 900
yash 678
deep 800
harsh 900
My expected Output is
Name colx seg
seema 900 True
harsh 900 True
yash 678 True
deep 800 False
Please help me correcting the given pyspark code or suggest the better way of doing it.
If I understand your question correctly what you want to do is this
res = df2.join(
df1,
on="colx",
how = "left"
).select(
"Name",
"colx"
).withColumn(
"seg",
F.when(F.col(colx).isNull(),F.lit(True)).otherwise(F.lit(False))
)
let me know if this is the solution you want.
my bad i did write the incorrect code in hurry below is the corrected one
import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([[1,678,56789],[2,900,67890],[3,789,67854]],['id', 'colx', 'coly'])
df2 = sqlContext.createDataFrame([["seema",900],["yash",678],["deep",800],["harsh",900]],['Name', 'colx'])
res = df2.join(
df1.withColumn(
"check",
F.lit(1)
),
on="colx",
how = "left"
).withColumn(
"seg",
F.when(F.col("check").isNotNull(),F.lit(True)).otherwise(F.lit(False))
).select(
"Name",
"colx",
"seg"
)
res.show()
+-----+----+-----+
| Name|colx| seg|
+-----+----+-----+
| yash| 678| true|
|seema| 900| true|
|harsh| 900| true|
| deep| 800|false|
+-----+----+-----+
You can join on colx and fill null values with False:
result = (df2.join(df1.select(df1['colx'], F.lit(True).alias('seg')),
on='colx',
how='left')
.fillna(False, subset='seg'))
result.show()
Output:
+----+-----+-----+
|colx| Name| seg|
+----+-----+-----+
| 900|seema| true|
| 900|harsh| true|
| 800| deep|false|
| 678| yash| true|
+----+-----+-----+
Related
Df consists of these rows [rn,rn1,rn2].
The condition is,if rn is null,generate a random number between 0-1000 and then assign that value to rn1,rn2.any suggestions please.
I have tried all the possible options. Could not figure out since I'm new to azure.please help
from pyspark.sql.functions import rand, col, when, floor
data = [(None, 200, 1000), (2,300,400), (None, 300,500)]
df = spark.createDataFrame(data).toDF("rn","rn1","rn2")
>>> df.select("*").show()
+----+---+----+
| rn|rn1| rn2|
+----+---+----+
|null|200|1000|
| 2|300| 400|
|null|300| 500|
+----+---+----+
df.select( df.rn,
when(
df.rn.isNull() , # condition
floor(rand()*1000) # true value
).otherwise(
df.rn1 # false value
).alias("rn1") ).show()
+----+---+
| rn|rn1|
+----+---+
|null|545|
| 2|300|
|null|494|
+----+---+
Rinse and repeat for rn2.
How I can use two dataframes, and select elements of df2, if a column in df1 is included in a column in df2and NA otherwise.
df2:
name
summer
winter
water
play
df1:
col1
play ground
winter cold
something
work
output:
col1 name
play ground play
winter cold winter
something NA
work NA
#Create match column
df1 = df1.alias('df1').withColumn('col_new',explode(split('col1','\s')))
new = (df1.join(df2, how='left',on=df1.col_new==df2.name)#merge on common columns
.drop('col_new')#drop the match column introduced
.orderBy([df2.name.desc(),'name'])#Order the df
.drop_duplicates(['col1'])#eliminate duplicates
).show()
+-----------+------+
| col1| name|
+-----------+------+
|play ground| play|
| something| null|
|winter cold|winter|
| work| null|
+-----------+------+
It is recommended to use the contains condition directly to join.
df = df1.join(df2, on=[df1.col1.contains(df2.name)], how='left')
df.show(truncate=False)
df1 = spark.createDataFrame([("play ground",),("winter cold",),("something",),("work",)], ['col1',])
df2 = spark.createDataFrame([("summer",),("winter",),("play bc",),("play",)], ['name',])
df1 = df1.withColumn('common_word', explode(split(col('col1'), '\s')))
# Also split & explode Column 'name' of df2.
df2 = df2.withColumn('common_word', explode(split(col('name'), '\s')))
(
df1
.join(df2, ['common_word'], "left")
.sort('col1')
.fillna("NA")
.show()
)
+-----------+-----------+-------+
|common_word| col1| name|
+-----------+-----------+-------+
| ground|play ground| NA|
| play|play ground|play bc|
| play|play ground| play|
| something| something| NA|
| cold|winter cold| NA|
| winter|winter cold| winter|
| work| work| NA|
+-----------+-----------+-------+
I have 2 dataframes in PySpark,
df1 = spark.createDataFrame([
("s1", "artist1"),
("s2", "artist2"),
("s3", "artist3"),
],
['song_id', 'artist'])
df1.show()
df2 = spark.createDataFrame([
("s1", "2"),
("s1", "3"),
("s4", "4"),
("s4", "5")
],
['song_id', 'duration'])
df2.show()
Output:
+-------+-------+
|song_id| artist|
+-------+-------+
| s1|artist1|
| s2|artist2|
| s3|artist3|
+-------+-------+
+-------+-----+
|song_id|col_2|
+-------+-----+
| s1| hmm|
| s1| hmmm|
| s4| acha|
| s4| ohoo|
+-------+-----+
I apply right_outer and left join on these 2 dataframes and they both seem to give me the same result-
df1.join(df2, on="song_id", how="right_outer").show()
df2.join(df1, on="song_id", how="left").show()
Output:
+-------+-------+--------+
|song_id| artist|duration|
+-------+-------+--------+
| s1|artist1| 2|
| s1|artist1| 3|
| s4| null| 4|
| s4| null| 5|
+-------+-------+--------+
+-------+--------+-------+
|song_id|duration| artist|
+-------+--------+-------+
| s1| 2|artist1|
| s1| 3|artist1|
| s4| 4| null|
| s4| 5| null|
+-------+--------+-------+
I am not sure how to use these 2 joins effectively.
What is the difference between these 2 joins?
The left and right joins gives result based on the order of table respective to join keyword.
Left/leftouter/left_outer joins are all same and shows the whole left table and the matching records of the right table.
Right/rightouter/right_outer joins are all same and shows the whole right table and the matching records of the left table.
In the code
df1.join(df2, on="song_id", how="right_outer").show()
df1 is the left table(dataframe) and df2 is the right table and join type is right_outer, hence it shows all the rows of df2 and matching rows of the df1.
Similarly in
df2.join(df1, on="song_id", how="left").show()
df2 is the left table and df1 is the right table and the join type is left, so it shows all records of df2 and matching records of df1.
Hence both code shows the same result.
df1.join(df2, on="song_id", how="right_outer").show()
df1.join(df2, on="song_id", how="left").show()
In the above code, I have placed df1 as left table in both queries.
And here is the result:-
song_id
artist
duration
s1
artist1
2
s1
artist1
3
s4
null
4
s4
null
5
song_id
artist
duration
s1
artist1
2
s1
artist1
3
s2
artist2
null
s3
artist3
null
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join
You can use this for reference.
I have a pyspark.sql.DataFrame.dataframe df
id col1
1 abc
2 bcd
3 lal
4 bac
i want to add one more column flag in the df such that if id is odd no, flag should be 'odd' , if even 'even'
final output should be
id col1 flag
1 abc odd
2 bcd even
3 lal odd
4 bac even
I tried:
def myfunc(num):
if num % 2 == 0:
flag = 'EVEN'
else:
flag = 'ODD'
return flag
df['new_col'] = df['id'].map(lambda x: myfunc(x))
df['new_col'] = df['id'].apply(lambda x: myfunc(x))
It Gave me error : TypeError: 'Column' object is not callable
How do is use .apply ( as i use in pandas dataframe) in pyspark
pyspark doesn't provide apply, the alternative is to use withColumn function. Use withColumn to perform this operation.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
[1,"abc"],
[2,"bcd"],
[3,"lal"],
[4,"bac"]
],
["id","col1"]
)
df.show()
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| bcd|
| 3| lal|
| 4| bac|
+---+----+
df.withColumn(
"flag",
F.when(F.col("id")%2 == 0, F.lit("Even")).otherwise(
F.lit("odd"))
).show()
+---+----+----+
| id|col1|flag|
+---+----+----+
| 1| abc| odd|
| 2| bcd|Even|
| 3| lal| odd|
| 4| bac|Even|
+---+----+----+
I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11