NOT IN query performance issue with large data - postgresql

i was trying to get the id and the number from table with condition of number isn't in the id.
select id,number from tmp_t where number not in (select id from tmp_t)
Have tried the query and it's taking soooo looonggg... like almost 40 minutes and i got disconnected from server.
So what should i do? the data is around 500K rows..
So i wanted to show "here you go the id and the number, which the number didn't exist in the id."
Because i tried to insert the number, but the number is a FK and depending on the ID, so i wanted to know the id and the number, that's why i'm using not in.
Maybe someone know? Btw im using Postgresql-13

You can write it with NOT EXISTS instead, although these queries will have different results if any value of id is NULL (in which case, NOT IN probably yields not the answer you want, so NOT EXISTS is better from that perspective as well.)
select id,number from tmp_t where not exists
(select 1 from tmp_t a where a.id=tmp_t.number);
But your formulation is also efficient as long as work_mem is large enough.

Typically NOT EXISTS is faster (and doesn't suffer from surprises if NULL values are involved):
select t1.id, t1.number
from tmp_t t1
where not exists (select *
from tmp_t t2
where t2.id = t1.number)

Related

PostgreSQL how to GROUP BY single field from returned table

So I have complicated query, to simplify let it be like
SELECT
t.*,
SUM(a.hours) AS spent_hours
FROM (
SELECT
person.id,
person.name,
person.age,
SUM(contacts.id) AS contact_count
FROM
person
JOIN contacts ON contacts.person_id = person.id
) AS t
JOIN activities AS a ON a.person_id = t.id
GROUP BY t.id
Such query works fine in MySQL, but Postgres needs to know that GROUP BY field is unique, and despite it actually is, in this case I need to GROUP BY all returned fields from returned t table.
I can do that, but I don't believe that will work efficiently with big data.
I can't JOIN with activities directly in first query, as person can have several contacts which will lead query counting hours of activity several time for every joined contact.
Is there a Postgres way to make this query work? Maybe force to treat Postgres t.id as unique or some other solution that will make same in Postgres way?
This query will not work on both database system, there is an aggregate function in the inner query but you are not grouping it(unless you use window functions). Of course there is a special case for MySQL, you can use it with disabling "sql_mode=only_full_group_by". So, MySQL allows this usage because of it' s database engine parameter, but you cannot do that in PostgreSQL.
I knew MySQL allowed indeterminate grouping, but I honestly never knew how it implemented it... it always seemed imprecise to me, conceptually.
So depending on what that means (I'm too lazy to look it up), you might need one of two possible solutions, or maybe a third.
If you intent is to see all rows (perform the aggregate function but not consolidate/group rows), then you want a windowing function, invoked by partition by. Here is a really dumbed down version in your query:
.
SELECT
t.*,
SUM (a.hours) over (partition by t.id) AS spent_hours
FROM t
JOIN activities AS a ON a.person_id = t.id
This means you want all records in table t, not one record per t.id. But each row will also contain a sum of the hours for all values that value of id.
For example the sum column would look like this:
Name Hours Sum Hours
----- ----- ---------
Smith 20 120
Jones 30 30
Smith 100 120
Whereas a group by would have had Smith once and could not have displayed the hours column in detail.
If you really did only want one row per t.id, then Postgres will require you to tell it how to determine which row. In the example above for Smith, do you want to see the 20 or the 100?
There is another possibility, but I think I'll let you reply first. My gut tells me option 1 is what you're after and you want the analytic function.

redshift nulling columns when joining another table

table_1 has 35 columns, table_2 has 20 columns
query is:
select table1.*,
table2.f1,
...
table2.f20
FROM public.table_1 as table1
left join public.table_2 as table2
on table1.id = table2.id
and table1.arrival_time::date <= table2.end_date::date
and table2.activity_date < table2.end_date
;
this works I expect 469 rows to be returned and that's what I get. However several fields from table_1 get displayed as null instead of the values in the table.
These fields are NOT part of the join.
Due to IP concerns I can't provide the full details of the tables, each field in table_1 and table_2 are varchar (don't ask me why a timestamp is stored as a varchar - its a long story that I have no control over)
This query WORKS in RDS PostgreSQL!
Any ideas why it has a problem in redshift?
Well I'll be very confused.
table_1 is data from two sources joined together - I didn't even think to look at the sources. Turns out the linked source had no data for one value.
Just goes to show that when looking at just a piece of the data you need to look HARD at all the data.
Now I'm off to find a better source for the missing data.
Thanks for your time!
James

Postgres subquery has access to column in a higher level table. Is this a bug? or a feature I don't understand?

I don't understand why the following doesn't fail. How does the subquery have access to a column from a different table at the higher level?
drop table if exists temp_a;
create temp table temp_a as
(
select 1 as col_a
);
drop table if exists temp_b;
create temp table temp_b as
(
select 2 as col_b
);
select col_a from temp_a where col_a in (select col_a from temp_b);
/*why doesn't this fail?*/
The following fail, as I would expect them to.
select col_a from temp_b;
/*ERROR: column "col_a" does not exist*/
select * from temp_a cross join (select col_a from temp_b) as sq;
/*ERROR: column "col_a" does not exist
*HINT: There is a column named "col_a" in table "temp_a", but it cannot be referenced from this part of the query.*/
I know about the LATERAL keyword (link, link) but I'm not using LATERAL here. Also, this query succeeds even in pre-9.3 versions of Postgres (when the LATERAL keyword was introduced.)
Here's a sqlfiddle: http://sqlfiddle.com/#!10/09f62/5/0
Thank you for any insights.
Although this feature might be confusing, without it, several types of queries would be more difficult, slower, or impossible to write in sql. This feature is called a "correlated subquery" and the correlation can serve a similar function as a join.
For example: Consider this statement
select first_name, last_name from users u
where exists (select * from orders o where o.user_id=u.user_id)
Now this query will get the names of all the users who have ever placed an order. Now, I know, you can get that info using a join to the orders table, but you'd also have to use a "distinct", which would internally require a sort and would likely perform a tad worse than this query. You could also produce a similar query with a group by.
Here's a better example that's pretty practical, and not just for performance reasons. Suppose you want to delete all users who have no orders and no tickets.
delete from users u where
not exists (select * from orders o where o.user_d = u.user_id)
and not exists (select * from tickets t where t.user_id=u.ticket_id)
One very important thing to note is that you should fully qualify or alias your table names when doing this or you might wind up with a typo that completely messes up the query and silently "just works" while returning bad data.
The following is an example of what NOT to do.
select * from users
where exists (select * from product where last_updated_by=user_id)
This looks just fine until you look at the tables and realize that the table "product" has no "last_updated_by" field and the user table does, which returns the wrong data. Add the alias and the query will fail because no "last_updated_by" column exists in product.
I hope this has given you some examples that show you how to use this feature. I use them all the time in update and delete statements (as well as in selects-- but I find an absolute need for them in updates and deletes often)

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.

TSQL Keyword Previous or Last or something similar

This question is geared for those who have more SQL experience than me.
I am writing a query(that will eventually be a Stored Procedure but this should be irrelevant) where I want to select the count of rows if the most recent entry's is equivalent to the one that was just entered before. And i want to continue to do this until it hits an entry that has a different value. (Poorly explained so I will show the example)
In my table I have a column 'Product_Id' and when this query is run i want it take the product_id and compare it to the previously entered product Id, if its the same I want to add one, and I want it to keep checking the previously entered product_id until it runs into a different product_id
I'm hoping it sounds more complicated than it is, and the query would look something like
Select count(Product_ID)
FROM dbo.myTable
Where Product_Id = previous(Product_Id)
Now, i know that previous isn't a keyword in TSQL, and neither was Last, but I'm hoping of someone who knows a keyword that does what I am asking.
Edit for Sam
USE DbName;
GO
WITH OrderedCount as
(
select ROW_NUMBER() OVER (Order by dbo.Line_Production.Run_Date DESC) as RowNumber,
Line_Production.Product_ID
From dbo.Line_Production
)
Select RowNumber, COUNT(OrderedCount.Product_ID) as PalletCount
From OrderedCount
WHERE OrderedCount.RowNumber + 1 = RowNumber
and Product_ID = Product_ID
Group by RowNumber
The OrderedCount portion works, and it returns the data back how I want it, I'm now having trouble comparing the Product_ID's for different RowNumbers
my Where Clause is wrong
There's no keyword. That would be a nice magic solution, but it doesn't exist, at least in part because there is no guaranteed ordering (okay, you could have the keyword only if there is an ORDER BY...). I can write you a query, but that'll take time, so for now I'll give you a few steps and I'll come back and see if you still need help in a bit.
Figure out an ORDER BY, otherwise no order is guaranteed. If there is a time entered field, that's a good choice, or an index, that works too.
Learn to use Row_Number.
Compare the table (with Row_Number) to itself where instance1.row - 1 = instance2.row.
If product_id is an identity column, couldn't you just do product_id - 1? In other words, if it's sequential, it's the same as using ROW_NUMBER mentioned in the previous comment.