Pyspark left joins dataframes using incorrect joining key values - pyspark

I have 2 spark dataframes.
df1 with columns customerid, salary
df2 with column customerid2, education
Example:
df1
| customerid | salary |
|------------|--------|
| C1 | 120 |
| C2 | 90 |
| C3 | 90 |
| C4 | 100 |
df2
| customerid2 | education |
|-------------|-----------|
| C1 | BA |
| C2 | BS |
| C5 | PhD |
| C4 | BS Physics|
I want a new data frame names df_new1 that joins above 2 data frames using following code.
I want to left join df1 with df2 using joining key customerid and customerid2.
df_new = df1.join(df2, on=x[df1.customerid==df2.customerid2],how='left')
Expected Output:
df_new
| customerid | salary | customerid2 | education |
|------------|--------|-------------|-----------|
| C1 | 120 | C1 | BA |
| C2 | 90 | C2 | BS |
| C3 | 90 | NULL | NULL |
| C4 | 100 | C4 | BS Physics|
Current Output:
df_new
| customerid | salary | customerid2 | education |
|------------|--------|-------------|-----------|
| C1 | 120 | C1 | BA |
| C2 | 90 | C5 | PhD | <-- Issue in this line
| C3 | 90 | NULL | NULL |
| C4 | 100 | C4 | BS Physics|
Issue is, when I perform a join for some of the records in spark dataframe, it's joining the 2 tables even though the customer ID values are different.
Appreciate a response from this great community on this very rare issue.

Taking your data as example it is generating expected output as you posted
>>> columns2 = ["customerid2","education"]
>>> data2=[("c1","BA"),("c2","BS"),("c5","phD"),("c4","BS Physics")]
>>> rdd2=sc.parallelize(data2)
>>> df2=rdd2.toDF(columns2)
>>> columns = ["customerid","salary"]
>>> data=[("c1","120"),("c2","90"),("c3","90"),("c4","100")]
>>> rdd=sc.parallelize(data)
>>> df1=rdd.toDF(columns)
>>> df_new = df1.join(df2,df1.customerid == df2.customerid2,"leftouter")
>>> df_new.show()
+----------+------+-----------+----------+
|customerid|salary|customerid2| education|
+----------+------+-----------+----------+
| c1| 120| c1| BA|
| c4| 100| c4|BS Physics|
| c3| 90| null| null|
| c2| 90| c2| BS|
+----------+------+-----------+----------+
can you check whether any of the Data does not contains leading and trailing spaces.

Related

How to replace null values in a dataframe based on values in other dataframe?

Here's a dataframe, df1, I have
+---------+-------+---------+
| C1 | C2 | C3 |
+---------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| by | 8 | srjs |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
| mi | 1 | pdlo |
| lp | 7 | ztpq |
+---------+-------+---------+
Here's another, df2, that I have
+----------+-------+---------+
| V1 | V2 | V3 |
+----------+-------+---------+
| Null | 6 | ixfg |
| Null | 2 | jsfd |
| Null | 2 | hfga |
| Null | 7 | qwks |
| Null | 1 | khfd |
| Null | 9 | gdbu |
+----------+-------+---------+
What I would like to have is another dataframe that
Ignores values in V2 and takes values in C2 whereever V3 and C3 match, and
Replaces V1 with values in C1 wherever V3 and C3 match.
The result should look like the following:
+----------+-------+---------+
| M1 | M2 | M3 |
+----------+-------+---------+
| xr | 1 | ixfg |
| we | 5 | jsfd |
| r5 | 7 | hfga |
| v4 | 4 | qwks |
| c0 | 0 | khfd |
| ba | 2 | gdbu |
+----------+-------+---------+
You can join and use coalesce to take a value which has a higher priority.
** coalesce will take any number of columns (the highest priority to least in the order of arguments) and return first non-null value, so if you do want to replace with null when there is a null in the lower priority column, you cannot use this function.
df = (df1.join(df2, on=(df1.C3 == df2.V3))
.select(F.coalesce(df1.C1, df2.V1).alias('M1'),
F.coalesce(df2.V2, df1.C2).alias('M2'),
F.col('C3').alias('M3')))

how to use update query in psql

I have a table.(table name: test_table)
+------+
| Col1 |
+------+
| a1 |
| b1 |
| b1 |
| c1 |
| c1 |
| c1 |
+------+
I wanted to delete duplicate rows except one row in duplicates, like this
+------+
| Col1 |
+------+
| a1 |
| b1 |
| c1 |
+------+
So, I thought
make row number
delete duplicates by row number
and failed to make row number with this query
ALTER TABLE test_table ADD COLUMN row_num INTEGER;
UPDATE test_table SET row_num = subquery.row_num
FROM (SELECT ROW_NUMBER() OVER () AS row_num
FROM test_table) AS subquery;
the result is below
+------+---------+
| Col1 | row_num |
+------+---------+
| a1 | 1 |
| b1 | 1 |
| b1 | 1 |
| c1 | 1 |
| c1 | 1 |
| c1 | 1 |
+------+---------+
what part need to change for getting like this?
+------+---------+
| Col1 | row_num |
+------+---------+
| a1 | 1 |
| b1 | 2 |
| b1 | 3 |
| c1 | 4 |
| c1 | 5 |
| c1 | 6 |
+------+---------+

how to update a cell of a spark data frame

I have the following a dataFrame on which I'm trying to update a cell depending on some conditions (like sql update where..)
for example, let's say I have the following data Frame :
+-------+-------+
|datas |isExist|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | O |
| AA | O |
+-------+-------+
How could I update the values to X when datas=AA and isExist is O, here is the expected output :
+-------+-------+
|IPCOPE2|IPROPE2|
+-------+-------+
| AA | x |
| BB | x |
| CC | O |
| CC | O |
| DD | O |
| AA | x |
| AA | x |
| AA | X |
| AA | X |
+-------+-------+
I could do a filter, then union, but I think its not the best solution, I could also use the when, but in this case I had create a new line containing the same values except for the isExist column, in that example is an acceptable solution, but what if I have 20 column !!
You can create new column using withColumn (either putting original or updated value) and then drop isExist column.
I am not sure why you do not want to use when for it seems to be exactly what you need. The withColumn method, when used with an existing column name will simply replace the column by the new value:
df.withColumn("isExist",
when('datas === "AA" && 'isExist === "O", "X").otherwise('isExist))
.show()
+-----+-------+
|datas|isExist|
+-----+-------+
| AA| x|
| BB| x|
| CC| O|
| CC| O|
| DD| O|
| AA| x|
| AA| x|
| AA| X|
| AA| X|
+-----+-------+
Then you can use withColumnRenamed to change the names of your columns. (e.g. df.withColumnRenamed("datas", "IPCOPE2"))

Create dummy variables frame pyspark

I have a spark data frame like:
|---------------------|------------------------------|
| Brand | Model |
|---------------------|------------------------------|
| Hyundai | Elentra,Creta |
|---------------------|------------------------------|
| Hyundai | Creta,Grand i10,Verna |
|---------------------|------------------------------|
| Maruti | Eritga,S-cross,Vitara Brezza|
|---------------------|------------------------------|
| Maruti | Celerio,Eritga,Ciaz |
|---------------------|------------------------------|
I want a data frame like this:
|---------------------|---------|--------|--------------|--------|---------|
| Brand | Model0 | Model1 | Model2 | Model3 | Model4 |
|---------------------|---------|--------|--------------|--------|---------|
| Hyundai | Elentra | Creta | Grand i10 | Verna | null |
|---------------------|---------|--------|--------------|--------|---------|
| Maruti | Ertiga | S-Cross| Vitara Brezza| Celerio| Ciaz |
|---------------------|---------|--------|--------------|--------|---------|
I have used this code :
schema = StructType([
StructField("Brand", StringType()),StructField("Model", StringType())])
tempCSV = spark.read.csv("PATH\\Cars.csv", sep='|', schema=schema)
tempDF = tempCSV.select(
"Brand",
f.split("Model", ",").alias("Model"),
f.posexplode(f.split("Model", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Brand",
f.concat(f.lit("Model"),f.col("pos").cast("string")).alias("name"),
f.expr("Model[pos]").alias("val")
)\
.groupBy("Brand").pivot("name").agg(f.first("val")).toPandas()
But I'm not getting the desired result. Instead of giving the second table result its giving :
|---------------------|---------|--------|--------------|
| Brand | Model0 | Model1 | Model2 |
|---------------------|---------|--------|--------------|
| Hyundai | Elentra | Creta | Grand i10 |
|---------------------|---------|--------|--------------|
| Maruti | Ertiga | S-Cross| Vitara Brezza|
|---------------------|---------|--------|--------------|
Thanks in advance.
This is happening because you are pivoting data on pos which has the repeat value in the same brand group.
You can use the rownumber() and pivot your data to generate the desired result.
Here are the sample code on top of the data you have provided.
df = sqlContext.createDataFrame([('Hyundai',"Elentra,Creta"),("Hyundai","Creta,Grand i10,Verna"),("Maruti","Eritga,S-cross,Vitara Brezza"),("Maruti","Celerio,Eritga,Ciaz")],("Brand","Model"))
tmpDf = df.select("Brand",f.split("Model", ",").alias("Model"),f.posexplode(f.split("Model", ",")).alias("pos", "val"))
tmpDf.createOrReplaceTempView("tbl")
seqDf = sqlContext.sql("select Brand, Model, pos, val, row_number() over(partition by Brand order by pos) as rnk from tbl")
seqDf.groupBy('Brand').pivot('rnk').agg(f.first('val'))
This will generate following result.
+-------+-------+-------+-------+---------+-------------+----+
| Brand| 1| 2| 3| 4| 5| 6|
+-------+-------+-------+-------+---------+-------------+----+
| Maruti| Eritga|Celerio|S-cross| Eritga|Vitara Brezza|Ciaz|
|Hyundai|Elentra| Creta| Creta|Grand i10| Verna|null|
+-------+-------+-------+-------+---------+-------------+----+

In Spark scala, how to check between adjacent rows in a dataframe

How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!
Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+