PySpark subrtact very large dataframes - pyspark

I need to extract data from two hive tables, which are very large. They are in two different schemas, but have same definition.
I need to compare the two tables and identify following in PySpark
rows that are present in table1, but missing in table2
rows that are present in both tables , but there is mismatch in values in any of the non key columns
rows that are present in table2 , but missing in table1
e.g.
Let's say the table has following cols
ProductId - BigInteger - PK
ProductVersion - int - PK
ProductName - char
ProductPrice - decimal
ProductDesc - varchar
Let's say the data is as follows
Table1 in Schema1
[1, 1, "T-Shirt", 10.50, "Soft-Washed Slub-Knit V-Neck"] -> Matches with Table2
[1, 2, "T-Shirt", 10.50, "Soft-Washed Striped Crew-Neck "] -> Price is different in Table1
[2, 1, "Short Sleeve Shirt", 10.50, "Everyday Printed Short-Sleeve Shirt"] -> Missing in Table2
[3, 1, "T-Shirt", 10.50, "Breathe ON Camo Tee"] -> Prod Desc is different in Table2
Table2 in Schema2
[1, 1, "T-Shirt", 10.50, "Soft-Washed Slub-Knit V-Neck"] -> Matches with Table1
[2, 1, "Short Sleeve Shirt", 12.50, "Everyday Printed Short-Sleeve Shirt"] -> Price is different
[3, 1, "T-Shirt", 10.50, "Breathe ON Camo"] -> Prod Desc is different in Table2
[3, 2, "T-Shirt", 20, "Breathe ON ColorBlock Tee"] -> Missing in Table1
The expected result will be three separate data frames
dfOut1 - will contain the rows that are present in table1 , but missing in table2 based on the primary key
["Missing in Table2", [1, 2, "T-Shirt", 10.50, "Soft-Washed Striped Crew-Neck "]]
The first column will indicate the difference type,
If the difference type is "Missing in Table1" or "Missing in Table2", the entire row from the source table will be available i
dfdiff -
["Difference", "ProductPrice", 2, 1, 10.50, 12.50]
["Difference", "ProductDesc", 3,1, "Breathe ON Camo Tee", "Breathe ON Camo"]
dfout2 -
["Missing in Table1", [3, 2, "T-Shirt", 20, "Breathe ON ColorBlock Tee"]]
I am thinking of following approach
1. Create df1 from table1 using query "select * from schema1.table1"
2. Create df2 from table2 using query "select * from schema2.table2"
3. Use df1.except(df2)
I referred to the documentation
I am not sure if this approach will work
Will df1.except(df2) compare all the fields , or just the key columns ?
Also, not sure how to separate the output further

You are basically trying to find insert updates and deletes ( The deltas) between two datasets. here is one generic solution for such deltas
from pyspark.sql.functions import sha2, concat_ws
#gettting the comma sperated keys to list
key_column_list = keys.split(',')
key_column_list= [x.strip().lower() for x in key_column_list]
#The column name of the chnage indicator column to be found
changeindicator="chg_id"
df_compare_curr_df = spark.sql("select * from table1")
df_compare_prev_df = spark.sql("select * from table2")
#getting columns List
currentcolumns = df_compare_curr_df.columns
previouscolumns = df_compare_curr_df.columns
#Creating Hash values so that this can generic for used for any kind of delta comparison
df_compare_curr_df = df_compare_curr_df.withColumn("all_hash_val", sha2(concat_ws("||", *currentcolumns), 256))
df_compare_curr_df = df_compare_curr_df.withColumn("key_val", sha2(concat_ws("||", *key_column_list), 256))
df_compare_prev_df = df_compare_prev_df.withColumn("key_val", sha2(concat_ws("||", *key_column_list), 256))
df_compare_prev_df = df_compare_prev_df.withColumn("all_hash_val", sha2(concat_ws("||", *previouscolumns), 256))
df_compare_curr_df.createOrReplaceTempView("NewTable")
df_compare_prev_df.createOrReplaceTempView("OldTable")
#creating the sql for delta basically left and inner joins .
insert_sql = "select 'I' as " + changeindicator + ",A.* from NewTable A left outer join OldTable B on A.key_val = B.key_val where B.key_val is NULL"
update_sql = "select 'U' as " + changeindicator + ",A.* from NewTable A inner join OldTable B on A.key_val = B.key_val where A.all_hash_val != B.all_hash_val"
delete_sql = "select 'D' as " + changeindicator + ",A.* from OldTable A left outer join NewTable B on A.key_val = B.key_val where B.key_val is NULL"
nochange_sql = "select 'N' as " + changeindicator + ",A.* from OldTable A inner join NewTable B on A.key_val = B.key_val where A.all_hash_val = B.all_hash_val"
upsert_sql = insert_sql + " union " + update_sql
all_changes_sql = insert_sql + " union " + update_sql + " union " + delete_sql
df_compare_updates = spark.sql(update_sql)
df_compare_inserts = spark.sql(insert_sql)
df_compare_deletes = spark.sql(delete_sql)
df_compare_upserts = spark.sql(upsert_sql)
df_compare_changes = spark.sql(all_changes_sql)

Related

How to display Text Unit only one time if it repeated for same Feature when do stuff?

I work with SQL Server 2012 and face an issue: I can't display Text Unit only one time where it repeated for feature using Stuff.
What I need is when Text Unit is repeated for same feature, then no need to repeat it - only display it once.
In my case, I face issue that I can't prevent repeat Text Unit when It be same Text Unit for same Feature.
Voltage | Voltage | Voltage ONLY one Voltage display .
CREATE TABLE #FinalTable
(
PartID INT,
DKFeatureName NVARCHAR(100),
TextUnit NVARCHAR(100),
StatusId INT
)
INSERT INTO #FinalTable (PartID, DKFeatureName, TextUnit, StatusId)
VALUES
(1211, 'PowerSupply', 'Voltage', 3),
(1211, 'PowerSupply', 'Voltage', 3),
(1211, 'PowerSupply', 'Voltage', 3)
SELECT
PartID, DKFeatureName,
COUNT(PartID) AS CountParts,
TextUnit = STUFF ((SELECT ' | ' + TextUnit
FROM #FinalTable b
WHERE b.PartID = a.PartID
AND a.DKFeatureName = b.DKFeatureName
AND StatusId = 3
FOR XML PATH('')), 1, 2, ' ')
INTO
#getUnitsSticky
FROM
#FinalTable a
GROUP BY
PartID, DKFeatureName
HAVING
(COUNT(PartID) > 1)
SELECT *
FROM #getUnitsSticky
Expected result is :
Voltage
Incorrect result or result I don't need is as below :
Voltage|Voltage|Voltage
TomC's answer is basically correct. However, when using this method with SQL Server, it is usually more efficient to get the rows in a subquery and then use stuff() in the outer query. That way, the values in each row are processed only once.
So:
SELECT PartID, DKFeatureName, CountParts,
STUFF( (SELECT ' | ' + TextUnit
FROM #FinalTable b
WHERE b.PartID = a.PartID AND
b.DKFeatureName = a.DKFeatureName AND
StatusId = 3
FOR XML PATH('')
), 1, 3, ' ') as TextUnit
INTO #getUnitsSticky
FROM (SELECT PartID, DKFeatureName, COUNT(*) as CountParts
FROM #FinalTable a
GROUP BY PartID, DKFeatureName
HAVING COUNT(*) > 1
) a;
This also removes the leading space from the concatenated result.
To put this into a complete answer - this should be your SQL (shortened slightly and removed the last temp table):
SELECT
PartID, DKFeatureName,
COUNT(PartID) AS CountParts,
TextUnit = STUFF ((SELECT distinct ' | ' + TextUnit
FROM #FinalTable b
WHERE b.PartID = a.PartID
AND a.DKFeatureName = b.DKFeatureName
AND StatusId = 3
FOR XML PATH('')), 1, 2, ' ')
FROM #FinalTable a
GROUP BY PartID, DKFeatureName
HAVING (COUNT(PartID) > 1)

Convert jsonb in PostgreSQL to rows without cycle

ffI have a json array stored in my postgres database. The first table "Orders" looks like this:
order_id, basket_items_id
1, {1,2}
2, {3}
3, {1,2,3,1}
Second table "Items" looks like this:
item_id, price
1,5
2,3
3,20
Already tried to load data with multiple sql and select of different jsonb record, but this is not a silver bullet.
SELECT
sum(price)
FROM orders
INNER JOIN items on
orders.basket_items_id = items.item_id
WHERE order_id = 3;
Want to get this as output:
order_id, basket_items_id, price
1, 1, 5
1, 2, 3
2, 3, 20
3, 1, 5
3, 2, 3
3, 3, 20
3, 1, 5
or this:
order_id, sum(price)
1, 8
2, 20
3, 33
demo:db<>fiddle
SELECT
o.order_id,
elems.value::int as basket_items_id,
i.price
FROM
orders o, jsonb_array_elements_text(basket_items_id) as elems
LEFT JOIN items i
ON i.item_id = elems.value::int
ORDER BY 1,2,3
jsonb_array_elements_text expands the jsonb array into one row each element. With this you are able to join against your second table directly
Since the expanded array gives you text elements you have to cast them into integers using ::int
Of course you can GROUP and SUM aggregate this as well:
SELECT
o.order_id,
SUM(i.price)
FROM
orders o, jsonb_array_elements_text(basket_items_id) as elems
LEFT JOIN items i
ON i.item_id = elems.value::int
GROUP BY o.order_id
ORDER BY 1
Is your orders.basket_items_id column of type jsonb or int[]?
If the type is jsonb you can use json_array_elements_text to expand the column:
SELECT
o.order_id,
o.basket_item_id,
items.price
FROM
(
SELECT
order_id,
jsonb_array_elements_text(basket_items_id)::int basket_item_id
FROM
orders
) o
JOIN
items ON o.basket_item_id = items.item_id
ORDER BY
1, 2, 3;
See this DB-Fiddle.
If the type is int[] (array of integers), you can run a similar query with the unnest function:
SELECT
o.order_id,
o.basket_item_id,
items.price
FROM
(
SELECT
order_id,
unnest(basket_items_id) basket_item_id
FROM
orders
) o
JOIN
items ON o.basket_item_id = items.item_id
ORDER BY
1, 2, 3;
See this DB-fiddle

How to merge JSONB field in a tree structure?

I have a table in Postgres which stores a tree structure. Each node has a jsonb field: params_diff:
CREATE TABLE tree (id INT, parent_id INT, params_diff JSONB);
INSERT INTO tree VALUES
(1, NULL, '{ "some_key": "some value" }'::jsonb)
, (2, 1, '{ "some_key": "other value", "other_key": "smth" }'::jsonb)
, (3, 2, '{ "other_key": "smth else" }'::jsonb);
The thing I need is to select a node by id with additional generated params field which contains the result of merging all params_diff from the whole parents chain:
SELECT tree.*, /* some magic here */ AS params FROM tree WHERE id = 3;
id | parent_id | params_diff | params
----+-----------+----------------------------+-------------------------------------------------------
3 | 2 | {"other_key": "smth else"} | {"some_key": "other value", "other_key": "smth else"}
Generally, a recursive CTE can do the job. Example:
Use table alias in another query to traverse a tree
We just need a more magic to decompose, process and re-assemble the JSON result. I am assuming from your example, that you want each key once only, with the first value in the search path (bottom-up):
WITH RECURSIVE cte AS (
SELECT id, parent_id, params_diff, 1 AS lvl
FROM tree
WHERE id = 3
UNION ALL
SELECT t.id, t.parent_id, t.params_diff, c.lvl + 1
FROM cte c
JOIN tree t ON t.id = c.parent_id
)
SELECT id, parent_id, params_diff
, (SELECT json_object(array_agg(key ORDER BY lvl)
, array_agg(value ORDER BY lvl))::jsonb
FROM (
SELECT key, value
FROM (
SELECT DISTINCT ON (key)
p.key, p.value, c.lvl
FROM cte c, jsonb_each_text(c.params_diff) p
ORDER BY p.key, c.lvl
) sub1
ORDER BY lvl
) sub2
) AS params
FROM cte
WHERE id = 3;
How?
Walk the tree with a classic recursive CTE.
Create a derived table with all keys and values with jsonb_each_text() in a LATERAL JOIN, remember the level in the search path (lvl).
Use DISTINCT ON to get the "first" (lowest lvl) value for each key. Details:
Select first row in each GROUP BY group?
Sort and aggregate resulting keys and values and feed the arrays to json_object() to build the final params value.
SQL Fiddle (only as far as pg 9.3 can go with json instead of jsonb).

Concatenate two colums in one in SQL

I have two columns, on of them is a foreign ID, how can I concatenate them into one column?
example:
StateID = 1
Area = "Bronx"
To become:
New York - Bronx
Edit:
Table1 = [Address] has two columns, (ID, Name)
Table2 = [Requests] has many columns including (Area, StateID)
Use + to concatenate columns:
SELECT a.Name + ' - ' + r.Area As StateAndArea
FROM dbo.Requests r INNER JOIN dbo.Address a
ON r.StateID = a.ID
ORDER BY StateAndArea -- ( alias can be used in order by but not in where )
+ (String Concatenation)
SELECT StateID + ' - ' + Area AS SateArea

Postgres sql query difficult join

Here are the tables I have:
Table A which has entries with "item" and "grade" fields
Table B which has entries with A.id
Tuple table B-C
I want all the A entries that have item= "x" and grade = "y"
And all the C entries that are associated with a B entry that is associated with an A entry that has item = "x" and grade = "y"
For example
A table:
A.item = "x", A.Grade = "y", A.id = 1
A.item = "x", A.Grade = "y", A.id = 2
A.item = "x", A.Grade = "y", A.id = 3
A.item = "r", A.Grade = "z", A.id = 4
B Table
B.AID = 1, B.id = 10
B.AID = 1, B.id = 11
B.AID = 2, B.id = 13
B.AID = 3, B.id = 14
B.AID = 4, B.id = 15
B-C Tuple Table
BID = 10, CID = 20
BID = 11, CID = 20
BID = 13, CID = 20
BID = 15, CID = 21
The query should return all the entries in the A table and the entry 20 but not 21 in the C table because C.id = 21 is only tupled with a B that is associated with an A that does not meet the item and grade requirements.
The associations, while sounding complicated in written form, are just a simple join among three tables: a joins to b joins to c.
You identify how the columns need to be joined: "a B entry that is associated with an A entry", and looking at the columns sounds like you want to join on b.aid = a.id. Similarly for b and c.
SELECT ...
FROM
a
JOIN b ON b.aid = a.id
JOIN b_c ON b_c.bid = b.id
WHERE
...
This constructs the original dataset before it was split into the three normalised tables.
The next step is to filter by the given conditions. You only want rows where " item = "x" and grade = "y"", so add those to WHERE clause prefixing them with the table name, which is optional in this case):
WHERE
a.item = 'x'
AND a.grade = 'y'
Finally, you can pick which columns you really need, in the SELECT clause. I'm gussing SELECT b_c.cid would do. Though if you also have a c table you might want to join on that table, too, and select columns from it.