Deduplication problems in Pyspark - pyspark

I have one dataframe with many rows of id, date and other information. It contains 2,856,134 records. A count distinct of ID results in 1,552,184 records.
Using this:
DF2 = sorted(DF.groupBy(DF.id).max('date').alias('date').collect())
Gives me the max date per ID, and results in 1,552,184 records, which matches the above. So far so good.
I try to join DF2 back to DF where id = id and max_date = date:
df3 = DF2.join(DF,(DF2.id==DF.id)&(DF2.Max_date==DF.date),"left")
This results in 2,358,316 records - which is different than the original amount.
I changed the code to:
df3 = DF2.join(DF,(DF2.id==DF.id)&(DF2.Max_date==DF.date),"left").dropDuplicates()
This results in 1,552,508 records (which is odd, since it should return 1,552,184 from the de-duplicated DF2 above.
Any idea what's happening here? I presume it's something to do with my join function.
Thanks!

its because your table 2 has duplicate entries for example:
Table1 Table2
_______ _________
1 2
2 2
3 5
4 6
SELECT Table1.Id, Table2.Id FROM Table1 LEFT OUTER JOIN Table2 ON Table1.Id=Table2.Id
Results:
1,null
2,2
2,2
3,null
4,null
I hope this will help you in solving your problem

Related

How to include and exclude ids in once query postgresql

I use PostgreSQL 13.3
I'm trying to think how I can make include/exclude in query at the same time
I have include_system_ids [1,5] and exclude_system_ids [3]
There's one big table - records
system_records table
record
system_id
1
1
1
5
1
3
2
1
2
5
If a record contains an exclusive identifier, then it should not be included in the final selection. I had some several tries, but I didn't get a necessary result
Awaiting result: record with id 2
Fact result: 1, 2
My variants
select r.id from records r
left join (select record_id from system_records
where system_id in (1,5)
) include_ids on r.id = include_ids
left join (select record_id from system_records
where system_id not in (3)
) exclude_ids on r.id = exclude_ids.id
Honestly, I don't understand how I can do it((
Is there anyone who can help me
Maybe this query could be a solution (result here)
with x as (select record,string_agg(system_id::varchar,',') as sys_id from records group by record)
select records.*
from records,x
where records.record = x.record
and x.sys_id = '1,5'

Snowflake "Exploding Join" issue while doing left join for multiple tables

I am trying to do some left joins on multiple tables and facing the following issue.
Row Counts of tables
Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M
When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
;
Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City
FROM Table1
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):
SELECT
Table1.ID1, Table1.ID2,
Table2.Name, Table2.City,
Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1
But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.
this could be seen via:
SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3
ORDER BY counts DESC
;
this will show the total distinct pairs, and which are the worst contributors to the combination explosion
When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.
Try reading these questions here to help here and here
Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns
i.e. :-
Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System
HAVING COUNT(*)>1 -- Filtering on duplicate rows that has more than a row satisfying the join condition
Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records
Have you tried adding a DISTINCT clause?
SELECT DISTINCT columns, of, choice
FROM Table1
LEFT JOIN Table2 on ...
LEFT JOIN Table3 on ...
I think what's happening is you have dups that left join on another giant set of dups.
Use the proper keys to join the two tables, it solves the issue.

T-SQL select all IDs that have value A and B

I'm trying to find all IDs in TableA that are mentioned by a set of records in TableB and that set if defined in Table C. I've come so far to the point where a set of INNER JOIN provide me with the following result:
TableA.ID | TableB.Code
-----------------------
1 | A
1 | B
2 | A
3 | B
I want to select only the ID where in this case there is an entry for both A and B, but where the values A and B are based on another Query.
I figured this should be possible with a GROUP BY TableA.ID and HAVING = ALL(Subquery on table C).
But that is returning no values.
Since you did not post your original query, I will assume it is inside a CTE. Assuming this, the query you want is something along these lines:
SELECT ID
FROM cte
WHERE Code IN ('A', 'B')
GROUP BY ID
HAVING COUNT(DISTINCT Code) = 2;
It's an extremely poor question, but you you probably need to compare distinct counts against table C
SELECT a.ID
FROM TableA a
GROUP BY a.ID
HAVING COUNT(DISTINCT a.Code) = (SELECT COUNT(*) FROM TableC)
We're guessing though.

Retrieving the rows using join query

I have two tables like this
A B
---- -------
col1 col2 col1 col2
---------- -----------
A table contains 300k rows
B table contains 400k rows
I need to count the col1 for table A if it is matching col1 for table B
I have written a query like this:
select count(distinct ab.col1) from A ab join B bc on(ab.col1=bc.col1)
but this takes too much time
could try a group by...
Also ensure that the col1 is indexed in both tables
SELECT COUNT (col1 )
FROM
(
SELECT aa.col1
FROM A aa JOIN B bb on aa.col1 = bb.col1
GROUP BY (aa.col1)
)
It's difficult to answer without you positing more details: did you analyze the tables? Do you have an index on col1 on each table? How many rows are you counting?
That being said, there aren'y so many potential query plans for your query. You likely have two seq scans that are hash joined together, which is about the best you can do... If you've a material numbers of rows, you'll be counting a gazillion rows, and this takes time.
Perhaps you could rewrite the query differently? If every B.col1 is in A.col1, you could get the same result without the join:
select count(distinct col1) from B
If A has low cardinality, it might be faster to rely on exists():
with vals as (
select distinct A.col1 as val from A
)
select count(*) from vals
where exists(select 1 from B where B.col1 = vals.val)
Or, if you know every possible value from A.col1 and it's reasonably small, you could unnest an array without querying A at all:
select count(*) from unnest(Array[val1, val2, ...]) as vals (val)
where exists(select 1 from B where B.col1 = vals.val)
Or vice-versa, in each of the above, if every B holds the reference values.

How do I get the max date value from each of my partitions

I have created a partition table which is partitioned based on years of a certain field
e.g. partition 1 - Year 2011 |
partition 2 - Year 2012 |
partition 3 - Year 2013
How can i show the max date for each partition
e.g. partition 1 - 2011/12/15 |
partition 2 - 2012/12/25 |
partition 3 - 2013/12/16
Just group the records by the year in the field you used to partition the table. More especifically, you can use this:
select year(my_date_field), max(my_date_field)
from my_partitioned_table
group by year(my_date_field)
order by year(my_date_field)
The order by above is not required, but gives you a nicely ordered result set.
Use max() function (Example).
Select max(y1) y1, max(y2) y2, max(y3) y3
From t
Let me try again. Now I think this query below will do the trick:
SELECT year(cast(rv.value as date)) _year,
p.partition_number,
max_date
FROM sys.partitions p
JOIN sys.indexes i
ON (p.object_id = i.object_id AND p.index_id = i.index_id)
JOIN sys.partition_schemes ps
ON (ps.data_space_id = i.data_space_id)
JOIN sys.partition_functions f
ON (f.function_id = ps.function_id)
LEFT JOIN sys.partition_range_values rv
ON (f.function_id = rv.function_id AND p.partition_number = rv.boundary_id)
JOIN sys.destination_data_spaces dds
ON (dds.partition_scheme_id = ps.data_space_id
AND dds.destination_id = p.partition_number)
JOIN sys.filegroups fg
ON (dds.data_space_id = fg.data_space_id)
INNER JOIN (
select year([partitioncol]) year,
max([partitioncol]) max_date
from TABLE1 group by year([partitioncol])
) table1
ON (table1.year=year(cast(rv.value as date)))
WHERE i.index_id < 2
AND i.object_id = Object_Id('TABLE1')
With my test data (which I will explain how to set up below), I got these results:
_year partition_number max_date
2011 1 2011-12-31 08:00:16.920
2012 2 2012-12-31 08:00:13.397
2013 3 2013-10-02 08:00:10.660
The key system table here is sys.partition_range_values. It gives us the range values for each partition that we join with the original agreggation query that returns the max dates per year.
To reproduce my results, follow the instructions in the link below (it's a post about creating partitioned tables). That will create and populate TABLE1.
http://www.mssqltips.com/sqlservertip/2888/how-to-partition-an-existing-sql-server-table/
Then run the query at the beggining of my post. I adapted it from a query presented by the link below:
http://www.sqlsuperfast.com/post/2011/02/22/T-SQL-Get-Partition-Details.aspx