Lookup table and get column from another table using pyspark

Lookup table and get column from another table using pyspark - pyspark

I have below two spark dataframes as below.
df1---->
ID col1 col2
---------------
001 abd xyz
002 eny opl
001 uyh ikl
003 ewr uji
002 opl rtn
001 jnu wbg
df2------>
ID col3 col4
-------------
001 acc1 jbo
002 acc1 unk
003 acc2 plo
004 acc3 edf
005 acc2 tgn
006 acc1 jhu
expected output--->
ID col1 col2 col3
---------------
001 abd xyz acc1
002 eny opl acc1
001 uyh ikl acc1
003 ewr uji acc3
002 opl rtn acc1
001 jnu wbg acc1
Can someone suggest the solution to obtain the expected output using pyspark

Left join on ID:
df1.join(df2, ['ID'], 'left').drop('col4').show()
+---+----+----+----+
| ID|col1|col2|col3|
+---+----+----+----+
|001| abd| xyz|acc1|
|002| eny| opl|acc1|
|001| uyh| ikl|acc1|
|003| ewr| uji|acc2|
|002| opl| rtn|acc1|
|001| jnu| wbg|acc1|
+---+----+----+----+

Related

Why PySpark partitionBy isn't working properly?

I have a table as:
COL1
COL2
COL3
COMP
0005
2008-08-04
COMP
0009
2002-01-01
COMP
01.0
2002-01-01
COMP
0005
2008-01-01
COMP
0005
2001-10-20
CTEC
0009
2001-10-20
COMP
0005
2009-10-01
COMP
01.0
2003-07-01
COMP
02.0
2004-01-01
CTEC
0009
2021-09-24
At first I want to partition the table on COL1, then do another partition on COL2, then sort the COL3 in descending order. Then I'm trying to add row number.
I write:
windowSpec = Window.partitionBy(col("COL1")).partitionBy(col("COl2")).orderBy(desc("COL3"))
TBL = TBL.withColumn(f"RANK", F.row_number().over(windowSpec))
My expected output is this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
1
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
2
But the output I'm getting is like this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
2
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
3
Can anyone please help me to figure out where I'm doing the mistake??

PySpark: Generating a Unique IDs for Data with Non-uniform and Repeated IDs

I have a dataset that is as indicated below. The first column is a date column, the second column comprises an id that can apply to any user (within the same department) and a given user can have multiple of these ids. The 3rd column is specific to a particular user but every so often, this id can be changed in case a user forgets it. The last column comprises of names and individuals can have similar names. I want to harmonize columns 1 and 2 prior to analysis as indicated below.
Input Dataset
Date
Id2
id3
Name
2018-01-01
001
AA
Name1
2018-01-02
001
AB
Name1
2018-01-01
001
AC
Name1
2018-01-04
003
AA
Name1
2018-01-01
004
AA
Name1
2018-01-01
001
AD
Name2
2018-01-04
002
AE
Name3
2018-01-04
005
AG
Name1
Desired output:
Date
Id2
id3
Name
2018-01-01
001
AA
Name1
2018-01-02
001
AA
Name1
2018-01-01
001
AA
Name1
2018-01-04
001
AA
Name1
2018-01-01
001
AA
Name1
2018-01-01
001
AD
Name2
2018-01-04
002
AE
Name3
2018-01-04
005
AG
Name1

so we have a table below and you are trying to map id2 column to just one id3 column because you said that they can be mapped to multiple ids, this would be my approach.
+---+----
|id2|id3|
+---+----
|001| AA|
|001| AC|
|003| BB|
|003| DE|
+---+---
First i would create a flag column that flags 1 ID2 column that has multiple ID3 with this code.
import pyspark.sql.functions as F
from pyspark.sql import Window
flag_df = dataframe.withColumn(
"dedup_flag",
F.dense_rank().over(
Window.partitionBy(
'id2'
).orderBy(F.col('id3').desc())) # this would help you pick the highest or lowest value in id3
)
which results in
+---+----+-----------+
|id2|id3|dedup_flag |
+---+----+-----------+
|001| AA| 1|
|001| AC| 2|
|003| BB| 1|
|003| DE| 2|
+---+----+-----------+
Then all you need to do is filter where dedup_flag = 1 and just do a left join with your original dataframe and flag dataframe on id2 to pull in all id3's associated to a given id2 with this code below.
final_df = dataframe.drop('id3').join(
flag_df.filter(F.col('dedup_flag') == 1).select('id2','id3'),
on= 'id2',
how= 'left'
)
which results to
+---+----
|id2|id3|
+---+----
|001| AA|
|001| AA|
|003| BB|
|003| BB|
+---+---
I hope this helped, acknowledge if this worked for you :) -Paula

before I can help I need to understand the question, so for each unique ID in column 2 generate a unique ID column?

Pyspark: How to fill null values based on value on another column

I want to fill null values on a Spark df based on the values of the id column.
Pyspark df:
index
id
animal
name
1
001
cat
doug
2
002
dog
null
3
001
cat
null
4
003
null
null
5
001
null
doug
6
002
null
bob
7
003
bird
larry
Expected result:
index
id
animal
name
1
001
cat
doug
2
002
dog
bob
3
001
cat
doug
4
003
bird
larry
5
001
cat
doug
6
002
dog
bob
7
003
bird
larry

You can use last (or first) with window function.
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('id')
df = (df.withColumn('animal', F.last('animal', ignorenulls=True).over(w))
.withColumn('name', F.last('name', ignorenulls=True).over(w)))

Get the second max value grouping by id

I have a table with the records:
ID | Value
1 | 999
1 | 005
1 | 001
2 | 003
2 | 999
3 | 999
3 | 999
I need to get the max value if different of 999 else get the 999, example of result:
ID | Value
1 | 005
2 | 003
3 | 999

According to your requirement, create a table
Create Table dbo.tblMaxVal(id int,value int)
Insert your data into table and here an example
Select m.id,min(m.value) as value from (
Select distinct t.id,Max(t.value) as Value from dbo.tblMaxVal t where t.value < (
Select v.val as value from (
Select id,Max(value) as val from dbo.tblMaxVal group by id order by id ) as v where v.id=t.id ) group by t.id
union
Select id, Max(value) from dbo.tblMaxVal group by id ) m group by m.id
I know this is not ideal query but maybe this query helpful for you to understand what you need.
Result -
id | value
1 | 005
2 | 003
3 | 999

How to query with "IN" in Q (kdb)?

Let's assume that I have a table in KBD named "Automotive" with following data:
Manufacturer Country Sales Id
Mercedes United States 002
Mercedes Canada 002
Mercedes Germany 003
Mercedes Switzerland 003
Mercedes Japan 004
BMW United States 002
BMW Canada 002
BMW Germany 003
BMW Switzerland 003
BMW Japan 004
How would I structure a query in Q such that I can fetch the records matching United States and Canada without using an OR clause?
In SQL, it would look something like:
SELECT Manufacturer, Country from Automotive WHERE Country IN ('United States', 'Canada')
Thanks in advance for helping this Q beginner!

It's basically the same in kdb. The way you write you query depends on the data type. See below an example where manufacturer is a symbol, and country is a string.
q)tbl:([]manufacturer:`Merc`Merc`BMW`BMW`BMW;country:("United States";"Canada";"United States";"Germany";"Japan");ID:til 5)
q)
q)tbl
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
BMW "Germany" 3
BMW "Japan" 4
q)meta tbl
c | t f a
------------| -----
manufacturer| s
country | C
ID | j
q)select from tbl where manufacturer in `Merc`Ford
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
q)
q)select from tbl where country in ("United States";"Canada")
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
Check out how to use Q-sql here: https://code.kx.com/q4m3/9_Queries_q-sql/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Lookup table and get column from another table using pyspark - pyspark

Left join on ID: df1.join(df2, ['ID'], 'left').drop('col4').show() +---+----+----+----+ | ID|col1|col2|col3| +---+----+----+----+ |001| abd| xyz|acc1| |002| eny| opl|acc1| |001| uyh| ikl|acc1| |003| ewr| uji|acc2| |002| opl| rtn|acc1| |001| jnu| wbg|acc1| +---+----+----+----+

Related

Why PySpark partitionBy isn't working properly?

PySpark: Generating a Unique IDs for Data with Non-uniform and Repeated IDs

Pyspark: How to fill null values based on value on another column

Get the second max value grouping by id

How to query with "IN" in Q (kdb)?

Categories

Resources