I need to extract data from two hive tables, which are very large. They are in two different schemas, but have same definition.
I need to compare the two tables and identify following in PySpark
rows that are present in table1, but missing in table2
rows that are present in both tables , but there is mismatch in values in any of the non key columns
rows that are present in table2 , but missing in table1
e.g.
Let's say the table has following cols
ProductId - BigInteger - PK
ProductVersion - int - PK
ProductName - char
ProductPrice - decimal
ProductDesc - varchar
Let's say the data is as follows
Table1 in Schema1
[1, 1, "T-Shirt", 10.50, "Soft-Washed Slub-Knit V-Neck"] -> Matches with Table2
[1, 2, "T-Shirt", 10.50, "Soft-Washed Striped Crew-Neck "] -> Price is different in Table1
[2, 1, "Short Sleeve Shirt", 10.50, "Everyday Printed Short-Sleeve Shirt"] -> Missing in Table2
[3, 1, "T-Shirt", 10.50, "Breathe ON Camo Tee"] -> Prod Desc is different in Table2
Table2 in Schema2
[1, 1, "T-Shirt", 10.50, "Soft-Washed Slub-Knit V-Neck"] -> Matches with Table1
[2, 1, "Short Sleeve Shirt", 12.50, "Everyday Printed Short-Sleeve Shirt"] -> Price is different
[3, 1, "T-Shirt", 10.50, "Breathe ON Camo"] -> Prod Desc is different in Table2
[3, 2, "T-Shirt", 20, "Breathe ON ColorBlock Tee"] -> Missing in Table1
The expected result will be three separate data frames
dfOut1 - will contain the rows that are present in table1 , but missing in table2 based on the primary key
["Missing in Table2", [1, 2, "T-Shirt", 10.50, "Soft-Washed Striped Crew-Neck "]]
The first column will indicate the difference type,
If the difference type is "Missing in Table1" or "Missing in Table2", the entire row from the source table will be available i
dfdiff -
["Difference", "ProductPrice", 2, 1, 10.50, 12.50]
["Difference", "ProductDesc", 3,1, "Breathe ON Camo Tee", "Breathe ON Camo"]
dfout2 -
["Missing in Table1", [3, 2, "T-Shirt", 20, "Breathe ON ColorBlock Tee"]]
I am thinking of following approach
1. Create df1 from table1 using query "select * from schema1.table1"
2. Create df2 from table2 using query "select * from schema2.table2"
3. Use df1.except(df2)
I referred to the documentation
I am not sure if this approach will work
Will df1.except(df2) compare all the fields , or just the key columns ?
Also, not sure how to separate the output further
You are basically trying to find insert updates and deletes ( The deltas) between two datasets. here is one generic solution for such deltas
from pyspark.sql.functions import sha2, concat_ws
#gettting the comma sperated keys to list
key_column_list = keys.split(',')
key_column_list= [x.strip().lower() for x in key_column_list]
#The column name of the chnage indicator column to be found
changeindicator="chg_id"
df_compare_curr_df = spark.sql("select * from table1")
df_compare_prev_df = spark.sql("select * from table2")
#getting columns List
currentcolumns = df_compare_curr_df.columns
previouscolumns = df_compare_curr_df.columns
#Creating Hash values so that this can generic for used for any kind of delta comparison
df_compare_curr_df = df_compare_curr_df.withColumn("all_hash_val", sha2(concat_ws("||", *currentcolumns), 256))
df_compare_curr_df = df_compare_curr_df.withColumn("key_val", sha2(concat_ws("||", *key_column_list), 256))
df_compare_prev_df = df_compare_prev_df.withColumn("key_val", sha2(concat_ws("||", *key_column_list), 256))
df_compare_prev_df = df_compare_prev_df.withColumn("all_hash_val", sha2(concat_ws("||", *previouscolumns), 256))
df_compare_curr_df.createOrReplaceTempView("NewTable")
df_compare_prev_df.createOrReplaceTempView("OldTable")
#creating the sql for delta basically left and inner joins .
insert_sql = "select 'I' as " + changeindicator + ",A.* from NewTable A left outer join OldTable B on A.key_val = B.key_val where B.key_val is NULL"
update_sql = "select 'U' as " + changeindicator + ",A.* from NewTable A inner join OldTable B on A.key_val = B.key_val where A.all_hash_val != B.all_hash_val"
delete_sql = "select 'D' as " + changeindicator + ",A.* from OldTable A left outer join NewTable B on A.key_val = B.key_val where B.key_val is NULL"
nochange_sql = "select 'N' as " + changeindicator + ",A.* from OldTable A inner join NewTable B on A.key_val = B.key_val where A.all_hash_val = B.all_hash_val"
upsert_sql = insert_sql + " union " + update_sql
all_changes_sql = insert_sql + " union " + update_sql + " union " + delete_sql
df_compare_updates = spark.sql(update_sql)
df_compare_inserts = spark.sql(insert_sql)
df_compare_deletes = spark.sql(delete_sql)
df_compare_upserts = spark.sql(upsert_sql)
df_compare_changes = spark.sql(all_changes_sql)
I would like to replace a set of running and non running numbers with commas and hyphens where appropriate.
Using STUFF & XML PATH I was able to accomplish some of what I want by getting something like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 15, 19, 20, 21, 22, 24.
WITH CTE AS (
SELECT DISTINCT t1.ORDERNo, t1.Part, t2.LineNum
FROM [DBName].[DBA].Table1 t1
JOIN Table2 t2 ON t2.Part = t1.Part
WHERE t1.ORDERNo = 'AB12345')
SELECT c1.ORDERNo, c1.Part, STUFF((SELECT ', ' + CAST(LineNum AS VARCHAR(5))
FROM CTE c2
WHERE c2.ORDERNo= c1.ORDERNo
FOR XML PATH('')), 1, 2, '') AS [LineNums]
FROM CTE c1
GROUP BY c1.ORDERNo, c1.Part
Here is some sample output:
ORDERNo Part LineNums
ON5650 PT01-0181 5, 6, 7, 8, 12
ON5652 PT01-0181 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 15, 19, 20, 21, 22, 24
ON5654 PT01-0181 1, 4
ON5656 PT01-0181 1, 2, 4
ON5730 PT01-0181 1, 2
ON5253 PT16-3934 1, 2, 3, 4, 5
ON1723 PT02-0585 1, 2, 3, 6, 8, 9, 10
Would like to have:
OrderNo Part LineNums
ON5650 PT01-0181 5-8, 12
ON5652 PT01-0181 1-10, 13, 15, 19-22, 24
ON5654 PT01-0181 1, 4
ON5656 PT01-0181 1-2, 4
ON5730 PT01-0181 1-2
ON5253 PT16-3934 1-5
ON1723 PT02-0585 1-3, 6, 8-10
This is a classic gaps-and-islands problem.
(a good read on the subject is Itzik Ben-Gan's Gaps and islands from SQL Server MVP Deep Dives)
The idea is that you first need to identify the groups of consecutive numbers. Once you've done that, the rest is easy.
First, create and populate sample table (Please save us this step in your future questions):
DECLARE #T AS TABLE
(
N int
);
INSERT INTO #T VALUES
(1), (2), (3), (4),
(6),
(8),
(10), (11),
(13), (14), (15),
(17),
(19), (20), (21),
(25);
Then, use a common table expression to identify the groups.
With Grouped AS
(
SELECT N,
N - ROW_NUMBER() OVER(ORDER BY N) As Grp
FROM #T
)
The result if this cte is this:
N Grp
1 0
2 0
3 0
4 0
6 1
8 2
10 3
11 3
13 4
14 4
15 4
17 5
19 6
20 6
21 6
25 9
As you can see, while the numbers are consecutive, the grp value stays the same.
When a row has a number that isn't consecutive with the previous number, the grp value changes.
Then you select from that cte, using a case expression to either select a single number (if it's the only one in it's group) or the start and end of the group, separated by a dash:
SELECT STUFF(
(
SELECT ', ' +
CASE WHEN MIN(N) = MAX(N) THEN CAST(MIN(N) as varchar(11))
ELSE CAST(MIN(N) as varchar(11)) +'-' + CAST(MAX(N) as varchar(11))
END
FROM Grouped
GROUP BY grp
FOR XML PATH('')
), 1, 2, '') As GapsAndIslands
The result:
GapsAndIslands
1-4, 6, 8, 10-11, 13-15, 17, 19-21, 25
For fun I put together another way using Window Aggregates (e.g. SUM() OVER ...). I also use some newer T-SQL functionality such as CONCAT (2012+) and STRING_AGG (2017+). This using Zohar's sample data.
DECLARE #T AS TABLE(N INT PRIMARY KEY CLUSTERED);
INSERT INTO #T VALUES (1),(2),(3),(4),(6),(8),(10),(11),(13),(14),(15),(17),(19),(20),(21),(25);
WITH
a AS (
SELECT t.N,isNewGroup = SIGN(t.N-LAG(t.N,1,t.N-1) OVER (ORDER BY t.N)-1)
FROM #t AS t),
b AS (
SELECT a.N, GroupNbr = SUM(a.isNewGroup) OVER (ORDER BY a.N)
FROM a),
c AS (
SELECT b.GroupNbr,
txt = CONCAT(MIN(b.N), REPLICATE(CONCAT('-',MAX(b.N)), SIGN(MAX(b.N)-MIN(b.N))))
FROM b
GROUP BY b.GroupNbr)
SELECT STRING_AGG(c.txt,', ') WITHIN GROUP (ORDER BY c.GroupNbr) AS Islands
FROM c;
Returns:
Islands
1-4, 6 , 8, 10-11, 13-15, 17, 19-21, 25
And here an approach using a recursive CTE.
DECLARE #T AS TABLE(N INT PRIMARY KEY CLUSTERED);
INSERT INTO #T VALUES (1),(2),(3),(4),(6),(8),(10),(11),(13),(14),(15),(17),(19),(20),(21),(25);
WITH Numbered AS
(
SELECT N, ROW_NUMBER() OVER(ORDER BY N) AS RowIndex FROM #T
)
,recCTE AS
(
SELECT N
,RowIndex
,CAST(N AS VARCHAR(MAX)) AS OutputString
,(SELECT MAX(n2.RowIndex) FROM Numbered n2) AS MaxRowIndex
FROM Numbered WHERE RowIndex=1
UNION ALL
SELECT n.N
,n.RowIndex
,CASE WHEN A.TheEnd =1 THEN CONCAT(r.OutputString,CASE WHEN IsIsland=1 THEN '-' ELSE ',' END, n.N)
WHEN A.IsIsland=1 AND A.IsWithin=0 THEN CONCAT(r.OutputString,'-')
WHEN A.IsIsland=1 AND A.IsWithin=1 THEN r.OutputString
WHEN A.IsIsland=0 AND A.IsWithin=1 THEN CONCAT(r.OutputString,r.N,',',n.N)
ELSE CONCAT(r.OutputString,',',n.N)
END
,r.MaxRowIndex
FROM Numbered n
INNER JOIN recCTE r ON n.RowIndex=r.RowIndex+1
CROSS APPLY(SELECT CASE WHEN n.N-r.N=1 THEN 1 ELSE 0 END AS IsIsland
,CASE WHEN RIGHT(r.OutputString,1)='-' THEN 1 ELSE 0 END AS IsWithin
,CASE WHEN n.RowIndex=r.MaxRowIndex THEN 1 ELSE 0 END AS TheEnd) A
)
SELECT TOP 1 OutputString FROM recCTE ORDER BY RowIndex DESC;
The idea in short:
First we create a numbered set.
The recursive CTE will use the row's index to pick the next row, thus iterating through the set row-by-row
The APPLY determines three BIT values:
Is the distance to the previous value 1, then we are on the island, otherwise not
Is the last character of the growing output string a hyphen, then we are waiting for the end of an island, otherwise not.
...and if we've reached the end
The CASE deals with this four-field-matrix:
First we deal with the end to avoid a trailing hyphen at the end
Reaching an island we add a hyphen
Staying on the island we just continue
Reaching the end of an island we add the last number, a comma and start a new island
any other case will just add a comma and start a new island.
Hint: You can read island as group or section, while the commas mark the gaps.
Combining what I already had and using Zohar Peled's code I was finally able to figure out a solution:
WITH cteLineNums AS (
SELECT TOP 100 PERCENT t1.OrderNo, t1.Part, t2.LineNum
, (t2.line_number - ROW_NUMBER() OVER(PARTITION BY t1.OrderNo, t1.Part ORDER BY t1.OrderNo, t1.Part, t2.LineNum)) AS RowSeq
FROM [DBName].[DBA].Table1 t1
JOIN Table2 t2 ON t2.Part = t1.Part
WHERE t1.OrderNo = 'AB12345')
GROUP BY t1.OrderNo, t1.Part, t2.LineNum
ORDER BY t1.OrderNo, t1.Part, t2.LineNum)
SELECT OrderNo, Part
, STUFF((SELECT ', ' +
CASE WHEN MIN(line_number) = MAX(line_number) THEN CAST(MIN(line_number) AS VARCHAR(3))
WHEN MIN(line_number) = (MAX(line_number)-1) THEN CAST(MIN(line_number) AS VARCHAR(3)) + ', ' + CAST(MAX(line_number) AS VARCHAR(3))
ELSE CAST(MIN(line_number) AS VARCHAR(3)) + '-' + CAST(MAX(line_number) AS VARCHAR(3))
END
FROM cteLineNums c1
WHERE c1.OrderNo = c2.OrderNo
AND c1.Part = c2.Part
GROUP BY OrderNo, Part
ORDER BY OrderNo, Part
FOR XML PATH('')), 1, 2, '') AS [LineNums]
FROM cteLineNums c2
GROUP BY OrderNo, Part
I used the ROW_NUMBER() OVER PARTITION BY since I returned multiple records with different Order Numbers and Part Numbers. All this lead to me still having to do the self join in the second part in order to get the correct LineNums to show for each record.
The second WHEN in the CASE statement is due to the code defaulting to having something like 2, 5, 8-9, 14 displayed when it should be 2, 5, 8, 9, 14.
I have a table that has an Identity column called ID and another column called DateID that references another table.
The date column is used in joins but the ID column has much more cardinality.
Distinct count for ID column : 657167
Distinct count for DateID column: 350
Can anyone please provide any insights as to which column would be a better choice for distribution key?
*Also regarding another question:
I have a dilemma in selecting sort and dist keys in my table.
sort Keys
Should I consider cardinality when selecting a sort key?
A column that would join with other tables would be candidates for a sort key, Is my assumption correct?
If I use compound sort key and use two columns does the order of columns matter?
If I define the column DateID as dist key should I put DateID in front of customerId while defining compound sort keys?*
another question merged to this old question as they are related.
P.S. I read some articles regarding choosing dist key and they say I should be using a column that is used in joining with other tables and has greater cardinality.
SELECT SP.*,
CP.*,
TV.*
FROM
(
SELECT * --> there are about 20 aggregation statements in the select statement
FROM FactCustomer f -- contains about 600K records
JOIN DimDate d -- contains about 700 records
ON f.DateID = d.DateID
JOIN DimTime t -- contains 24 records
ON f.TimeID = t.HourID
JOIN DimSalesBranch s -- contains about 64K records
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-12-31'
AND StartHour >= 9
AND starthour > 0
AND (EndHour <= 22)
) SP
LEFT JOIN
(
SELECT * --> there are about 20 aggregation statements in the select statement
FROM FactCustomer f
JOIN DimDate d
ON f.DateID = d.DateID
JOIN DimTime t
ON f.TimeID = t.HourID
JOIN DimSalesBranch s
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-09-16'
AND StartHour >= 9
AND (EndHour <= 22)
) CP
ON SP.StartDate = CP.StartDate_CP
AND SP.EndDate = CP.EndDate_CP
LEFT JOIN
(
SELECT * --> there are about 6 aggregation statements in the select statement
FROM FactSalesTargetBranch f
JOIN DimDate d
ON f.DateID = d.DateID
JOIN DimSalesBranch s
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-09-16'
) TV
ON SP.StartDate = TV.StartDate_TV
AND SP.EndDate = TV.EndDate_TV;
Any insights much appreciated.
Regards.
In this case
Use "even" distribution for your main table, this will allow good
paralellism. (dateid will be a bad candidate)
Use "all" distribution for your dateid table (the smaller table that
you join with)
Generally, "even" distribution is a good choice and will give you the best results unless you need to join large tables together.
see https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html
How can I select from a typed set in Oracle 10g?
I.E. SELECT * FROM (3,5,20,68,54,13,56,899,1)
Additionally, how would I filter it?
I.E. SELECT * FROM (3,5,20,68,54,13,56,899,1) WHERE > 5
Where is the data coming from and what are you planning on doing with it?
If the data is being read from a file, you would normally create an external table to read from the file or use SQL*Loader or some other ETL tool to load the data into a staging table or a PL/SQL collection that you could then query
SQL> create type num_tbl is table of number;
2 /
Type created.
SQL> ed
Wrote file afiedt.buf
1 declare
2 l_nums num_tbl := num_tbl( 3, 5, 20, 68, 54 );
3 begin
4 for x in (select * from table(l_nums))
5 loop
6 dbms_output.put_line( x.column_value );
7 end loop;
8* end;
SQL> /
3
5
20
68
54
If you're doing some sort of manual process, you would normally be looking for data from another table, i.e.
SELECT *
FROM some_other_table
WHERE some_key IN (3, 5, 20, 68, 54, 13, 56, 889, 1 );
If you're really trying to generate a data set full of arbitrary data pulled from a file that you don't want to use Oracle to read, you can always do a series of SELECT statements from DUAL that are all UNION ALL'd together but this obviously gets rather cumbersome.
WITH sample_data
AS (SELECT 3 num FROM dual UNION ALL
SELECT 5 FROM dual UNION ALL
SELECT 20 FROM dual UNION ALL
SELECT 68 FROM dual UNION ALL
SELECT 54 FROM dual UNION ALL
...
SELECT 1 FROM dual)
SELECT *
FROM sample_data
WHERE num > 5;
Additionally, using the WITH clause and a CSV string we can parse a string as a table.
Example:
VARIABLE liste VARCHAR2(100)
EXECUTE :liste := '5, 25, 41, 52';
WITH liste AS (
SELECT SUBSTR(:liste, INSTR(','||:liste||',', ',', 1, rn),
INSTR(','||:liste||',', ',', 1, rn+1) -
INSTR(','||:liste||',', ',', 1, rn)-1) valeur
FROM (
SELECT ROWNUM rn FROM DUAL
CONNECT BY LEVEL<=LENGTH(:liste) - LENGTH(REPLACE(:liste,',',''))+1))
SELECT TRIM(valeur)
FROM liste;