How to join items from different Dataframes to one common DataFrame - scala

Suppose We have a Dataframe 'A':
Id Name FavColor Address
1 John Black xyz
2 Mathew Orange www
3 Russel Red xxx
Now I have a case where different datasets comes as to update values in some columns,
for example Let us have DataFrame 'B' :
Id FavColor
1 Red
2 Black
and DataFrame 'C' :
Id Address
1 aaa
3 bbb
now in this case updates 'B' and 'C' needs to be merged in 'A',
I tried merging 'B' and 'C' first and then merging it to 'A', but when I merge 'B' and 'C' I get :
Id FavColor Address
1 Red aaa
2 Black null
3 null bbb
and if I merge this with 'A' it will be wrong as Address of Id=2 will become null and FavColor of Id=3 will become null. How can I merge the coming updated Data with 'A' and the coming data may have new attribute in that case it should show null for the items which do not have value for that attribute in 'A'.

Try merging data by using left join and getting only updated rows. Below code merges A and B, then you can merge their result with C in the same way.
scala> A.join(B, A("Id") === B("Id"), "left").
| withColumn("merged", when(B("FavColor").isNotNull, B("FavColor")).otherwise(A("FavColor"))).
| drop(B("FavColor")).drop(A("FavColor")).drop(B("Id")).
| withColumnRenamed("merged", "FavColor").show()
+---+------+-------+--------+
| Id| Name|Address|FavColor|
+---+------+-------+--------+
| 1| John| xyz| Red|
| 2|Mathew| www| Black|
| 3|Russel| xxx| Red|
+---+------+-------+--------+

Related

Sum values of specific rows if fields are the same

Hi Im trying to sum values of one column if 'ID' matches for all in a dataframe
For example
ID
Gender
value
1
Male
5
1
Male
6
2
Female
3
3
Female
0
3
Female
9
4
Male
10
How do I get the following table
ID
Gender
value
1
Male
11
2
Female
3
3
Female
9
4
Male
10
In the example above, ID with Value 1 is now showed just once and its value has been summed up (same for ID with value 3).
Thanks
Im new to Pyspark and still learning. I've tried count(), select and groupby() but nothing has resulted in what Im trying to do.
try this:
df = (
df
.withColumn('value', f.sum(f.col('value')).over(Window.partitionBy(f.col('ID'))))
)
Link to documentation about Window operation https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.window.html
You can use a simple groupBy, with the sum function:
from pyspark.sql import functions as F
(
df
.groupby("ID", 'Gender') # sum rows with same ID and Gender
# .groupby("ID") # use this line instead if you want to sum rows with the same ID, even if they have different Gender
.agg(F.sum('value').alias('value'))
)
The result is:
+---+------+-----+
| ID|Gender|value|
+---+------+-----+
| 1| Male| 11|
| 2|Female| 3|
| 3|Female| 9|
| 4| Male| 10|
+---+------+-----+

Seperating string and numeric values in Pyspark

I have the following dataframe. There are several ID's which have either a numeric or a string value. If the ID is a string_value like "B" the numeric_value is "NULL" as a string. Vice versa for the numeric value e.g. ID "D".
ID string_value numeric_value timestamp
0 B On NULL 1632733508
1 B Off NULL 1632733508
2 A Inactive NULL 1632733511
3 A Active NULL 1632733512
4 D NULL 450 1632733513
5 D NULL 431 1632733515
6 C NULL 20 1632733518
7 C NULL 30 1632733521
Now I want to seperate the dataframe in a new one for each ID by an existing list containing all the unique ID's. Afterwards the new dataframes like "B" in this example, should drop the column with the "NULL" values. So if B is a string_value the numeric_value should be dropped.
ID string_value timestamp
0 B On 1632733508
1 B Off 1632733508
After that, the column with the value should be renamed with the ID "B" and the ID column should be dropped.
B timestamp
0 On 1632733508
1 Off 1632733508
As described, the same procedure should be applied for the numeric values in this case ID "D"
ID numeric_value timestamp
0 D 450 1632733513
1 D 431 1632733515
D timestamp
0 450 1632733513
1 431 1632733515
It is important to safe the original data types within the value column.
Assuming your dataframe is called df and your list of IDs is ids. You can write a function which does what you need and call it for every id.
The function applies the required filter, and the selects the needed columns with the id as an alias.
from pyspark.sql import functions as f
ids = ["B", "A", "D", "C"]
def split_df(df, id):
split_df = df.filter(f.col("ID") == id).select(
f.coalesce(f.col("string_value"), f.col("numeric_value")).alias(id),
f.col("timestamp"),
)
return split_df
dfs = [split_df(df, id) for id in ids]
for df in dfs:
df.show()
output
+---+----------+
| B| timestamp|
+---+----------+
| On|1632733508|
|Off|1632733508|
+---+----------+
+--------+----------+
| A| timestamp|
+--------+----------+
|Inactive|1632733511|
| Active|1632733512|
+--------+----------+
+---+----------+
| D| timestamp|
+---+----------+
|450|1632733513|
|431|1632733515|
+---+----------+
+---+----------+
| C| timestamp|
+---+----------+
| 20|1632733518|
| 30|1632733521|
+---+----------+

PySpark join dataframes and merge contents of specific columns

My goal is to merge two dataframes on the column id, and perform a somewhat complex merge on another column that contains JSON we can call data.
Suppose I have the DataFrame df1 that looks like this:
id | data
---------------------------------
42 | {'a_list':['foo'],'count':1}
43 | {'a_list':['scrog'],'count':0}
And I'm interested in merging with a similar, but different DataFrame df2:
id | data
---------------------------------
42 | {'a_list':['bar'],'count':2}
44 | {'a_list':['baz'],'count':4}
And I would like the following DataFrame, joining and merging properties from the JSON data where id matches, but retaining rows where id does not match and keeping the data column as-is:
id | data
---------------------------------------
42 | {'a_list':['foo','bar'],'count':3} <-- where 'bar' is added to 'foo', and count is summed
43 | {'a_list':['scrog'],'count':1}
44 | {'a_list':['baz'],'count':4}
As can be seen where id is 42, there is a some logic I will have to apply to how the JSON is merged.
My knee jerk thought is that I'd like to provide a lambda / udf to merge the data column, but not sure how to think about that with during a join.
Alternatively, I could break the properties from the JSON into columns, something like this, that might be a better approach?
df1:
id | a_list | count
----------------------
42 | ['foo'] | 1
43 | ['scrog'] | 0
df2:
id | a_list | count
---------------------
42 | ['bar'] | 2
44 | ['baz'] | 4
Resulting:
id | a_list | count
---------------------------
42 | ['foo', 'bar'] | 3
43 | ['scrog'] | 0
44 | ['baz'] | 4
If I went this route, I would then have to merge the columns a_list and count into JSON again under a single column data, but this I can wrap my head around as a relatively simple map function.
Update: Expanding on Question
More realistically, I will have n number of DataFrames in a list, e.g. df_list = [df1, df2, df3], all shaped the same. What is an efficient way to perform these same actions on n number of DataFrames?
Update to Update
Not sure how efficient this is, or if there is a more spark-esque way to do this, but incorporating accepted answer, this appears to work for question update:
for i in range(0, (len(validations) - 1)):
# set dfs
df1 = validations[i]['df']
df2 = validations[(i+1)]['df']
# joins here...
# update new_df
new_df = df2
Here's one way to accomplish your second approach:
Explode the list column and then unionAll the two DataFrames. Next groupBy the "id" column and use pyspark.sql.functions.collect_list() and pyspark.sql.functions.sum():
import pyspark.sql.functions as f
new_df = df1.select("id", f.explode("a_list").alias("a_values"), "count")\
.unionAll(df2.select("id", f.explode("a_list").alias("a_values"), "count"))\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))
new_df.show(truncate=False)
#+---+----------+-----+
#|id |a_list |count|
#+---+----------+-----+
#|43 |[scrog] |0 |
#|44 |[baz] |4 |
#|42 |[foo, bar]|3 |
#+---+----------+-----+
Finally you can use pyspark.sql.functions.struct() and pyspark.sql.functions.to_json() to convert this intermediate DataFrame into your desired structure:
new_df = new_df.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))
new_df.show()
#+---+----------------------------------+
#|id |data |
#+---+----------------------------------+
#|43 |{"a_list":["scrog"],"count":0} |
#|44 |{"a_list":["baz"],"count":4} |
#|42 |{"a_list":["foo","bar"],"count":3}|
#+---+----------------------------------+
Update
If you had a list of dataframes in df_list, you could do the following:
from functools import reduce # for python3
df_list = [df1, df2]
new_df = reduce(lambda a, b: a.unionAll(b), df_list)\
.select("id", f.explode("a_list").alias("a_values"), "count")\
.groupBy("id")\
.agg(f.collect_list("a_values").alias("a_list"), f.sum("count").alias("count"))\
.select("id", f.to_json(f.struct("a_list", "count")).alias("data"))

except operation on two dataframe having a map column

I have two dataframes dfA and dfB. I want to remove all occurrences of dfB from dfA. The problem however is that they have a column which is of datatype map. except operation doesn't work well with that.
+--------+----------------------------------
|id | fee_amount | optional |
|1 | 10.00 | { 1 -> abc, 2-> def |
|2 | 20.0 | { 3 -> pqr, 5-> stu |
I was thinking I could drop the column somehow and add it back but it won't work because I wouldn't know which rows got removed from dfA. Options?

Tag unique rows?

I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.