Pyspark: How to fill null values based on value on another column - pyspark

I want to fill null values on a Spark df based on the values of the id column.
Pyspark df:
index
id
animal
name
1
001
cat
doug
2
002
dog
null
3
001
cat
null
4
003
null
null
5
001
null
doug
6
002
null
bob
7
003
bird
larry
Expected result:
index
id
animal
name
1
001
cat
doug
2
002
dog
bob
3
001
cat
doug
4
003
bird
larry
5
001
cat
doug
6
002
dog
bob
7
003
bird
larry

You can use last (or first) with window function.
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('id')
df = (df.withColumn('animal', F.last('animal', ignorenulls=True).over(w))
.withColumn('name', F.last('name', ignorenulls=True).over(w)))

Related

How to append a string to dataframe column which is list in pyspark

I have a dataframe with one column, "value", that contains a list of string, for example
id value
001 ["abc", "abd"]
002 ["xyz"]
003 []
I need append another string to "value", the result would be
id value
001 ["abc", "abd", "new"]
002 ["xyz", "new"]
003 ["new"]
Anyone know how to achieve this in pyspark? Thanks.
You can concat the array with an array of one literal string.
import pyspark.sql.functions as F
df.withColumn('value', F.concat(F.col('value'), F.array(F.lit('new'))))

Lookup table and get column from another table using pyspark

I have below two spark dataframes as below.
df1---->
ID col1 col2
---------------
001 abd xyz
002 eny opl
001 uyh ikl
003 ewr uji
002 opl rtn
001 jnu wbg
df2------>
ID col3 col4
-------------
001 acc1 jbo
002 acc1 unk
003 acc2 plo
004 acc3 edf
005 acc2 tgn
006 acc1 jhu
expected output--->
ID col1 col2 col3
---------------
001 abd xyz acc1
002 eny opl acc1
001 uyh ikl acc1
003 ewr uji acc3
002 opl rtn acc1
001 jnu wbg acc1
Can someone suggest the solution to obtain the expected output using pyspark
Left join on ID:
df1.join(df2, ['ID'], 'left').drop('col4').show()
+---+----+----+----+
| ID|col1|col2|col3|
+---+----+----+----+
|001| abd| xyz|acc1|
|002| eny| opl|acc1|
|001| uyh| ikl|acc1|
|003| ewr| uji|acc2|
|002| opl| rtn|acc1|
|001| jnu| wbg|acc1|
+---+----+----+----+

How to query with "IN" in Q (kdb)?

Let's assume that I have a table in KBD named "Automotive" with following data:
Manufacturer Country Sales Id
Mercedes United States 002
Mercedes Canada 002
Mercedes Germany 003
Mercedes Switzerland 003
Mercedes Japan 004
BMW United States 002
BMW Canada 002
BMW Germany 003
BMW Switzerland 003
BMW Japan 004
How would I structure a query in Q such that I can fetch the records matching United States and Canada without using an OR clause?
In SQL, it would look something like:
SELECT Manufacturer, Country from Automotive WHERE Country IN ('United States', 'Canada')
Thanks in advance for helping this Q beginner!
It's basically the same in kdb. The way you write you query depends on the data type. See below an example where manufacturer is a symbol, and country is a string.
q)tbl:([]manufacturer:`Merc`Merc`BMW`BMW`BMW;country:("United States";"Canada";"United States";"Germany";"Japan");ID:til 5)
q)
q)tbl
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
BMW "Germany" 3
BMW "Japan" 4
q)meta tbl
c | t f a
------------| -----
manufacturer| s
country | C
ID | j
q)select from tbl where manufacturer in `Merc`Ford
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
q)
q)select from tbl where country in ("United States";"Canada")
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
Check out how to use Q-sql here: https://code.kx.com/q4m3/9_Queries_q-sql/

KDB - Text parsing and cataloging text data

I have data made of varying periodic strings that are effectively a time value list with a periodicity flag contained within. Unfortunately, each string length can have a different number of elements but no more than 7.
Example below - (# and #/M at the end of each string means these are monthly values) starting at 8/2020 while #/Y are annual numbers so we divide by 12 for example to get to a monthly value. # at the beginning simple means continue from prior period.
copied from CSV
ID,seg,strField
AAA,1,8/2020 2333 2456 2544 2632 2678 #/M
AAA,2,# 3333 3456 3544 3632 3678 #
AAA,3,# 4333 4456 4544 4632 4678 #/M
AAA,4,11/2021 5333 5456 #/M
AAA,5,# 6333 6456 6544 6632 6678 #/Y
t:("SSS";enlist",") 0:`:./Data/src/strField.csv; // read in csv data above
t:update result:count[t]#enlist`float$() from t; // initiate empty result column
I would normally tokenize then pass each of the 7 columns to a function but the limit is 8 arguments and I would like to send other meta data in addition to these 7 arguments.
t:#[t;`tok1`tok2`tok3`tok4`tok5`tok6`tok7;:;flip .Q.fu[{" " vs'x}]t `strField];
t: ungroup t;
//Desired result
ID seg iDate result
AAA 1 8/31/2020 2333
AAA 1 9/30/2020 2456
AAA 1 10/31/2020 2544
AAA 1 11/30/2020 2632
AAA 1 12/31/2020 2678
AAA 2 1/31/2021 3333
AAA 2 2/28/2021 3456
AAA 2 3/31/2021 3544
AAA 2 4/30/2021 3632
AAA 2 5/31/2021 3678
AAA 3 6/30/2021 4333
AAA 3 7/31/2021 4456
AAA 3 8/31/2021 4544
AAA 3 9/30/2021 4632
AAA 3 10/31/2021 4678
AAA 4 11/30/2021 5333
AAA 4 12/31/2021 5456
AAA 5 1/31/2022 527.75 <-- 6333/12
AAA 5 2/28/2022 527.75
AAA 5 3/31/2022 527.75
AAA 5 4/30/2022 527.75
AAA 5 5/31/2022 527.75
AAA 5 6/30/2022 527.75
AAA 5 7/31/2022 527.75
AAA 5 8/31/2022 527.75
AAA 5 9/30/2022 527.75
AAA 5 10/31/2022 527.75
AAA 5 11/30/2022 527.75
AAA 5 12/31/2022 527.75
AAA 5 1/31/2023 538.00 <--6456/12
AAA 5 2/28/2023 538.00
AAA 5 3/31/2023 538.00
AAA 5 4/30/2023 538.00
AAA 5 5/31/2023 538.00
AAA 5 6/30/2023 538.00
AAA 5 7/31/2023 538.00
AAA 5 8/31/2023 538.00
AAA 5 9/30/2023 538.00
AAA 5 10/31/2023 538.00
AAA 5 11/30/2023 538.00
AAA 5 12/31/2023 538.00
AAA 5 1/31/2024 etc..
AAA 5 2/29/2024
AAA 5 3/31/2024
AAA 5 4/30/2024
AAA 5 5/31/2024
AAA 5 6/30/2024
AAA 5 7/31/2024
ddonelly is correct, a dictionary or list gets around the limitation of 8 parameters for functions but I think it is not the right approach here. Below achieves the desired output:
t:("SSS";enlist",") 0:`:so.csv;
// This will process each distinct ID separately as the date logic I have here would break if you had a BBB entry that starts date over
{[t]
t:#[{[x;y] select from x where ID = y}[t;]';exec distinct ID from t];
raze {[t]
t:#[t;`strField;{" "vs string x}'];
t:ungroup update`$date from delete strField from #[t;`date`result`year;:;({first x}each t[`strField];"J"${-1_1_x}each t[`strField];
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
delete year from ungroup update date:`$'string date from update result:?[year;result%12;result],
date:{x+til count x} each {max($[z;12#(x+12-x mod 12);1#x+1];y)}\[0;"M"$/:raze each reverse each
"/" vs/: string date;year] from t
} each t
}[t]
ID seg date result
AAA 1 2020.08 2333
AAA 1 2020.09 2456
AAA 1 2020.10 2544
AAA 1 2020.11 2632
AAA 1 2020.12 2678
AAA 2 2021.01 3333
AAA 2 2021.02 3456
AAA 2 2021.03 3544
AAA 2 2021.04 3632
AAA 2 2021.05 3678
AAA 3 2021.06 4333
AAA 3 2021.07 4456
AAA 3 2021.08 4544
AAA 3 2021.09 4632
AAA 3 2021.10 4678
AAA 4 2021.11 5333
AAA 4 2021.12 5456
AAA 5 2022.01 527.75
AAA 5 2022.02 527.75
AAA 5 2022.03 527.75
...
AAA 5 2023.01 538
AAA 5 2023.02 538
AAA 5 2023.03 538
AAA 5 2023.04 538
...
AAA 5 2024.01 545.3333
AAA 5 2024.02 545.3333
...
Below is a full breakdown of whats going on inside the nested function should you need it for understanding.
// vs (vector from scalar) is useful for string manipulation to separate the strField column into a more manageable list of seperate strings
t:#[t;`strField;{" "vs string x}'];
// split the strField out to more manageable columns
t:#[t;`date`result`year;:;
// date column from the first part of strField
({first x}each t[`strField];
// result for the actual value fields in the middle
"J"${-1_1_x}each t[`strField];
// year column which is a boolean to indicate special handling is needed.
// I also forward fill to account for rows which are continuation of
// the previous rows time period,
// e.g. if you had 2 or 3 lines in a row of continuous yearly data
`Y =fills #[("#/Y";"#/M";"#")!`Y`M`;last each t[`strField]])];
// ungroup to split each result into individual rows
t:ungroup update`$date from delete strField from t;
t:update
// divide yearly rows where necessary with a vector conditional
result:?[year;result%12;result],
// change year into a progressive month list
date:{x+til count x} each
// check if a month exists, if not take previous month + 1.
// If a year, previous month + 12 and convert to Jan
// create a list of Jans for the year which I convert to Jan->Dec above
{max($[z;12#(x+12-x mod 12);1#x+1];y)}\
// reformat date to kdb month to feed with year into the scan iterator above
[0;"M"$/:raze each reverse each "/" vs/: string date;year] from t;
// finally convert date to symbol again to ungroup year rows into individual rows
delete year from ungroup update date:`$'string date from t
could you pass the columns into a dictionary and then pass the dictionary into the function? This with circumvent the issue of having a maximum of 8 arguments since the dictionary can be as long as you require.

Generate a output string, while rows dependent on another row in Spark and Scala

I have a problem in which rows in spark dataset are dependent on each other and i need to generate output string from Spark Dataset is as follows:
DataType UniqueID NameId SurnameID In2 In1
Double 12345 5 4 QQQ BBB
Double 12345 6 5 BBB RSA
Double 12345 4 3 RRR QQQ
Double 12345 2 1 AAA FFF
Double 12345 6 5 FRD FG
Double 12345 7 6 FG EXIT
Double 12345 1 0 NuLL AAA
Double 12345 3 2 FFF RRR
Output String: AAA, FFF, RRR, QQQ, BBB, RSA
Logic to generate output string is:
Group data on column UniqueID
Look for NuLL and AAA in columns In2 and In1 respectively. Both NuLL and AAA are entry point.
Now look for AAA in In2 column and look its NameId in SurnameID. If NameId matches its value in In1 column value, append In1 column value to resultant String.
For Eg: In1 Column is AAA and its NameId column value is 1. Looking for AAA in In2 column its In1 value is FFF and nameId of AAA i.e. 1 == SurnameID of In2. There for Append FFF to resultant string
Similarly repeat step 2, until no In1 column value is found in In2 column.
If no value found terminate it and print output string
Thanks in Advance.
What you have is an edge table for a graph. Spark supports that in the GraphX component https://spark.apache.org/docs/latest/graphx-programming-guide.html