Flag data when value from one column is in another column - pyspark

I'm trying to create a flag in my dataset based on 2 conditions, the first is simple. Does CheckingCol = CheckingCol2.
The second is more complicated. I have a column called TranID and a column called RevID.
For nay row if RevID is in TranID AND CheckingCol = CheckingCol2 then the flag should return "Yes". Otherwise the flag should return "No".
My data looks like this:
TranID RevID CheckingCol CheckingCol2
1 2 ABC ABC
2 1 ABC ABC
3 6 ABCDE ABCDE
4 3 ABCDE ABC
5 7 ABCDE ABC
The expected result would be:
TranID RevID CheckingCol CheckingCol2 Flag
1 2 ABC ABC Yes
2 1 ABC ABC Yes
3 6 ABCDE ABCDE No
4 3 ABCDE ABC No
5 7 ABCDE ABC No
I've tried using:
df.withColumn("TotalMatch", when((col("RevID").contains(col("TranID"))) & (col("CheckingColumn") == col("CheckingColumn2")), "Yes").otherwise("No"))
But it didn't work, and I've not been able to find anything online about how to do this.
Any help would be great!

Obtain the unique values as array from the TranID column, then check for the RevID from that array using isIn() function
from pyspark.sql import functions as sf
unique_values = df1.agg(sf.collect_set("TranID").alias("uniqueIDs"))
unique_values.show()
+---------------+
| uniqueIDs|
+---------------+
|[3, 1, 2, 5, 4]|
+---------------+
required_array = unique_values.take(1)[0].uniqueIDs
['3', '1', '2', '5', '4']
df2 = df1.withColumn("Flag", sf.when( (sf.col("RevID").isin(required_array) & (sf.col("CheckingCol") ==sf.col("CheckingCol2")) ) , "Yes").otherwise("No"))
Note: Check for the nulls and NoneType values in both RevID and TranID columns since they will affect the results

Related

Summary and crosstabulation in Pyspark (DataBricks)

I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:

Get all the rows which have mismatch between values in columns in pyspark dataframe

I have the following pyspark dataframe df1 :-
SL No
category 1
category 2
1
Apples
Oranges
2
Apples
APPLE FRUIT
3
Grapes
Grape
4
Bananas
Oranges
5
Orange
Grape
I want to get the rows of the pyspark dataframe where the column values are not matching b/w columns category 1 and category 2 handling for case partial string match(category 1 contains only Apples/Bananas/Orange/Grapes strings and likewise Category 2 only contains only those distinct strings under Category 2) :-
SL No
category 1
category 2
1
Apples
Oranges
4
Bananas
Oranges
5
Orange
Grape
First, please avoid column names with spaces.
My df
df=spark.createDataFrame([(1, 'Apples', 'Oranges'),
(2, 'Apples', 'APPLE FRUIT'),
(3, 'Grapes', 'Grape'),
(4, 'Bananas', 'Oranges'),
(5, 'Orange', 'Grape')],
('SL No', 'category1 ', 'category 2'))
df.show()
new =(
#Make string columns have a common case and put them into a comma separated array
df.select('*',*[split(initcap(F.col(c)),'\s').alias(c+f'{"_1"}') for c in df.drop('SL No').columns])
#Filter the non wanted using a higher order functio
.filter(expr("filter(category1_1, (x,i)->rlike(x,category2_1[i]))")[0].isNull())
#Drop columns not wanted
.drop('category1_1','category2_1')
).show()
+-----+---------+---------+
|SL No|category1|category2|
+-----+---------+---------+
| 1| Apples| Oranges|
| 4| Bananas| Oranges|
| 5| Orange| Grape|
+-----+---------+---------+

KDB: why am I getting a type error when upserting?

I specified the columns to be of type String. Why am I getting the following error:
q)test: ([key1:"s"$()] col1:"s"$();col2:"s"$();col3:"s"$())
q)`test upsert(`key1`col1`col2`col3)!(string "999"; string "693"; string "943";
string "249")
'type
[0] `test upsert(`key1`col1`col2`col3)!(string "999"; string "693"; string "9
43"; string "249")
To do exactly this, you can remove the types of the list you defined in test:
q)test: ([key1:()] col1:();col2:();col3:())
q)test upsert (`key1`col1`col2`col3)!("999";"693";"943";"249")
key1 | col1 col2 col3
-----| -----------------
"999"| "693" "943" "249"
The reason you are getting a type error is because "s" corresponds to a list of symbols, not a list of characters. you can check this by using .Q.ty:
q).Q.ty `symbol$()
"s"
q).Q.ty `char$()
"c"
It is (generally) not a great idea to set the keys as nested list of chars, you might find it better to set them as integers ("i") or longs ("j") as in:
test: ([key1:"j"$()] col1:"j"$();col2:"j"$();col3:"j"$())
Having the keys as integers/longs will make the upsert function behave nicely. Also note that a table is a list of dictionaries, so each dictionary can be upserted inidividually as well as a table being upserted:
q)`test upsert (`key1`col1`col2`col3)!(9;4;6;2)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 4 6 2
q)`test upsert (`key1`col1`col2`col3)!(8;6;2;3)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 4 6 2
8 | 6 2 3
q)`test upsert (`key1`col1`col2`col3)!(9;1;7;4)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 1 7 4
8 | 6 2 3
q)`test upsert ([key1: 8 7] col1:2 4; col2:9 3; col3:1 9)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 1 7 4
8 | 2 9 1
7 | 4 3 9
You have a few issues:
an array of chars in quotes is a string so no need to write string "abc"
string "aaa" will split the string out in strings of strings
your initial defined types are symbols "s" and not strings
This will allow you to insert as symbols:
q)test: ([key1:"s"$()] col1:"s"$();col2:"s"$();col3:"s"$())
q)`test upsert(`key1`col1`col2`col3)!`$("999"; "693"; "943"; "249")
`test
This will keep them as strings:
q)test: ([key1:()] col1:();col2:();col3:())
q)`test upsert(`key1`col1`col2`col3)!("999"; "693"; "943"; "249")
`test
Have a look at the diffs in metas of the two
HTH,
Sean

Tag unique rows?

I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.

How to join items from different Dataframes to one common DataFrame

Suppose We have a Dataframe 'A':
Id Name FavColor Address
1 John Black xyz
2 Mathew Orange www
3 Russel Red xxx
Now I have a case where different datasets comes as to update values in some columns,
for example Let us have DataFrame 'B' :
Id FavColor
1 Red
2 Black
and DataFrame 'C' :
Id Address
1 aaa
3 bbb
now in this case updates 'B' and 'C' needs to be merged in 'A',
I tried merging 'B' and 'C' first and then merging it to 'A', but when I merge 'B' and 'C' I get :
Id FavColor Address
1 Red aaa
2 Black null
3 null bbb
and if I merge this with 'A' it will be wrong as Address of Id=2 will become null and FavColor of Id=3 will become null. How can I merge the coming updated Data with 'A' and the coming data may have new attribute in that case it should show null for the items which do not have value for that attribute in 'A'.
Try merging data by using left join and getting only updated rows. Below code merges A and B, then you can merge their result with C in the same way.
scala> A.join(B, A("Id") === B("Id"), "left").
| withColumn("merged", when(B("FavColor").isNotNull, B("FavColor")).otherwise(A("FavColor"))).
| drop(B("FavColor")).drop(A("FavColor")).drop(B("Id")).
| withColumnRenamed("merged", "FavColor").show()
+---+------+-------+--------+
| Id| Name|Address|FavColor|
+---+------+-------+--------+
| 1| John| xyz| Red|
| 2|Mathew| www| Black|
| 3|Russel| xxx| Red|
+---+------+-------+--------+