Degraded SQL Query Speed By Nesting a Single Query inline vs temp table - postgresql

I have a query of the following, basic form.
SELECT DISTINCT
a.field1,
b.field2,
c.agg_values
FROM a
INNER JOIN b ON a.something = b.something
LEFT JOIN (
SELECT
array_to_string(array_agg(label), ';;') AS agg_values,
some_table.some_field
FROM some_table
WHERE some_table.some_field = 'some-fixed-value'
GROUP BY some_field
) AS c ON a.some_field = c.some_field
WHERE a.some_other_field = 'some-other-fixed-value'
There's nothing too wild about this query. Pretty run of the mill!
This query runs pretty slow in my Postgres 9.4.5 (~4 minutes), where I have maybe 15k records returned total. some_table has probably ~10k records.
If I move the content of that LEFT JOIN sub-query to a temp table and left join from the temp table, my performance increases substantially. My query may take only 15s now, vs 240s. To be more explicit, if I remove SELECT array_to_string ... GROUP BY some_field query, and put that query into a temp table, then left join on that temp table, BAM, fast.
CREATE TEMP TABLE temp_table_c ( ... );
INSERT INTO temp_table_c SELECT ... same query nested in LEFT JOIN from before ...;
SELECT DISTINCT
a.field1,
b.field2,
c.agg_values
FROM a
INNER JOIN ON a.something = b.something
LEFT JOIN temp_table_c AS c ON a.some_field = c.some_field
WHERE a.some_other_field = 'some-other-fixed-value'
I would appreciate it if someone could explain why the TEMP TABLE version of the query is so much more performant.
Thanks!

Related

How to make postgres (cursor?) start at particular row

I have created the following query:
select t.id, t.row_id, t.content, t.location, t.retweet_count, t.favorite_count, t.happened_at,
a.id, a.screen_name, a.name, a.description, a.followers_count, a.friends_count, a.statuses_count,
c.id, c.code, c.name,
t.parent_id
from tweets t
join accounts a on a.id = t.author_id
left outer join countries c on c.id = t.country_id
where t.row_id > %s
-- order by t.row_id
limit 100
Where %s is a number that starts at 0 and is incremented by 100 after each such query is conducted. I want to fetch all records from the database using this method, where I just increase the %s in the where condition. I found this approach on https://ivopereira.net/efficient-pagination-dont-use-offset-limit. I also included a column in my table which is corresponding to row number (I named it row_id). Now the problem is when I run this query the first time, it returns rows which have an row_id of 3 million. I would like the cursor (not sure if my terminology is correct) to start from rows with row_id 1 through 100 and so on. The table contains 7 million rows. Am I missing something obvious with which I could achieve my goal?

Snowflake "Exploding Join" issue while doing left join for multiple tables

I am trying to do some left joins on multiple tables and facing the following issue.
Row Counts of tables
Table 1: 1.6M
Table 2: 1.7M
Table 3: 1.5M
When I am doing left Join using Table 1 and 2 and following query, I get data count as 1.8 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table2.Name, Table2.City
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
;
Similarly when I am doing left Join using Table 1 and 3 and following query, I get data count as 1.9 M (acceptable):
SELECT Table1.ID1, Table1.ID2, Table3.Name, Table3.City
FROM Table1
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
But when I am doing left Join using Table 1, 2 and 3 and following query, I get data count as 11.9 G (ISSUE):
SELECT
Table1.ID1, Table1.ID2,
Table2.Name, Table2.City,
Table3.Name as Name1, Table3.City as City1
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
LEFT JOIN Table3
ON Table1.ID1 = Table3.ID1
AND Table1.ID2 = Table3.ID2
AND Table1.Source_System = Table3.Source_System
;
So it seems you have assumed the data in table1 and table2 join in a 1:1 ratio, and also assumed the table1 and table3 are also a 1:1 ratio, so assumed when those three tables joined, that ration should be in the order again of 1:1
But if half you entries in table1 are not in table2 to get the 1.8M result, the the common rows would have to be duplicated > 2.0 times that increase. If we change that from half not matching to a tenth not matching there would need to be > 10.0 duplicates. Thus to get the 4 magnitude growth you have, it seems like you have only 100th match, but greater than 100.0 duplicates, which when cross joined give the 10,000 growth in rows.
this could be seen via:
SELECT Table1.ID1, Table1.ID2, Table1.Source_System, counnt(*) as counts
FROM Table1
LEFT JOIN Table2
ON Table1.ID1 = Table2.ID1
AND Table1.ID2 = Table2.ID2
AND Table1.Source_System = Table2.Source_System
GROUP BY 1,2,3
ORDER BY counts DESC
;
this will show the total distinct pairs, and which are the worst contributors to the combination explosion
When your left join is producing more records than the referenced table it should not be acceptable! that should signal warning in your join condition and data. Either you investigate those records in the table to avoid it in the first place or you would need to keep tweaking your SQL to satisfy clean join that produces exact reference table row count. otherwise, it is very common that left joining to another table with a small duplicate records will produce exponential row count as you are facing here.
Try reading these questions here to help here and here
Just to add about investigating and finding those rows, use following SQL to find in each table what rows that have same ID1, ID2 and Source_System columns
i.e. :-
Select ID1, ID2 ,Source_System, COUNT(*) AS NUM_RECORDS_DUPS
FROM TABLE1
GROUP BY ID1, ID2 , Source_System
HAVING COUNT(*)>1 -- Filtering on duplicate rows that has more than a row satisfying the join condition
Use the same for each of the tables to find those records and either add another unique condition/ aggregate the table on the joining keys or ask for data cleansing ! for those records
Have you tried adding a DISTINCT clause?
SELECT DISTINCT columns, of, choice
FROM Table1
LEFT JOIN Table2 on ...
LEFT JOIN Table3 on ...
I think what's happening is you have dups that left join on another giant set of dups.
Use the proper keys to join the two tables, it solves the issue.

MYSQL- query too slow to load

My query is working but it takes time to display the data. Can you help me to make it quick.
$sql="SELECT allinvty3.*, stock_transfer_tb.* from stock_transfer_tb
INNER JOIN allinvty3 on stock_transfer_tb.in_code = allinvty3.in_code
where stock_transfer_tb.in_code NOT IN (SELECT barcode.itemcode from barcode where stock_transfer_tb.refnumber = barcode.refitem)";
I would recommend using the following query:
SELECT
a.*,
s.*
FROM stock_transfer_tb s
INNER JOIN allinvty3 a
ON s.in_code = a.in_code
WHERE
NOT EXISTS (SELECT 1 FROM barcode b
WHERE s.refnumber = b.refitem AND s.in_code = b.itemcode);
If this still doesn't give you the performance you want, then you should look into adding indices on all columns involved in the join and where clause.

Postgres join not respecting outer where clause

In SQL Server, I know for sure that the following query;
SELECT things.*
FROM things
LEFT OUTER JOIN (
SELECT thingreadings.thingid, reading
FROM thingreadings
INNER JOIN things on thingreadings.thingid = things.id
ORDER BY reading DESC LIMIT 1) AS readings
ON things.id = readings.thingid
WHERE things.id = '1'
Would join against thingreadings only once the WHERE id = 1 had restricted the record set down. It left joins against just one row. However in order for performance to be acceptable in postgres, I have to add the WHERE id= 1 to the INNER JOIN things on thingreadings.thingid = things.id line too.
This isn't ideal; is it possible to force postgres to know that what I am joining against is only one row without explicitly adding the WHERE clauses everywhere?
An example of this problem can be seen here;
I am trying to recreate the following query in a more efficient way;
SELECT things.id, things.name,
(SELECT thingreadings.id FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1),
(SELECT thingreadings.reading FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1)
FROM things
WHERE id IN (1,2)
http://sqlfiddle.com/#!15/a172c/2
Not really sure why you did all that work. Isn't the inner query enough?
SELECT t.*
FROM thingreadings tr
INNER JOIN things t on tr.thingid = t.id AND t.id = '1'
ORDER BY tr.reading DESC
LIMIT 1;
sqlfiddle demo
When you want to select the latest value for each thingID, you can do:
SELECT t.*,a.reading
FROM things t
INNER JOIN (
SELECT t1.*
FROM thingreadings t1
LEFT JOIN thingreadings t2
ON (t1.thingid = t2.thingid AND t1.reading < t2.reading)
WHERE t2.thingid IS NULL
) a ON a.thingid = t.id
sqlfiddle demo
The derived table gets you the record with the most recent reading, then the JOIN gets you the information from things table for that record.
The where clause in SQL applies to the result set you're requesting, NOT to the join.
What your code is NOT saying: "do this join only for the ID of 1"...
What your code IS saying: "do this join, then pull records out of it where the ID is 1"...
This is why you need the inner where clause. Incidentally, I also think Filipe is right about the unnecessary code.

T-SQL query one table, get presence or absence of other table value

I'm not sure what this type of query is called so I've been unable to search for it properly. I've got two tables, Table A has about 10,000 rows. Table B has a variable amount of rows.
I want to write a query that gets all of Table A's results but with an added column, the value of that column is a boolean that says whether the result also appears in Table B.
I've written this query which works but is slow, it doesn't use a boolean but rather a count that will be either zero or one. Any suggested improvements are gratefully accepted:
SELECT u.number,u.name,u.deliveryaddress,
(SELECT COUNT(productUserid)
FROM ProductUser
WHERE number = u.number and productid = #ProductId)
AS IsInPromo
FROM Users u
UPDATE
I've run the query with actual execution plan enabled, I'm not sure how to show the results but various costs are:
Nested Loops (left semi join): 29%]
Clustered Index scan (User Table): 41%
Clustered Index Scan (ProductUser table): 29%
NUMBERS
There are 7366 users in the users table and currently 18 rows in the productUser table (although this will change and could be in the thousands)
You can use EXISTS to short circuit after the first row is found rather than COUNT-ing all matching rows.
SQL Server does not have a boolean datatype. The closest equivalent is BIT
SELECT u.number,
u.name,
u.deliveryaddress,
CASE
WHEN EXISTS (SELECT *
FROM ProductUser
WHERE number = u.number
AND productid = #ProductId) THEN CAST(1 AS BIT)
ELSE CAST(0 AS BIT)
END AS IsInPromo
FROM Users u
RE: "I'm not sure what this type of query is called". This will give a plan with a semi join. See Subqueries in CASE Expressions for more about this.
Which management system are you using?
Try this:
SELECT u.number,u.name,u.deliveryaddress,
case when COUNT(p.productUserid) > 0 then 1 else 0 end
FROM Users u
left join ProductUser p on p.number = u.number and productid = #ProductId
group by u.number,u.name,u.deliveryaddress
UPD: this could be faster using mssql
;with fff as
(
select distinct p.number from ProductUser p where p.productid = #ProductId
)
select u.number,u.name,u.deliveryaddress,
case when isnull(f.number, 0) = 0 then 0 else 1 end
from Users u left join fff f on f.number = u.number
Since you seem concerned about performance, this query can perform faster as this will cause index seek on both tables versus an index scan:
SELECT u.number,
u.name,
u.deliveryaddress,
ISNULL(p.number, 0) IsInPromo
FROM Users u
LEFT JOIN ProductUser p ON p.number = u.number
WHERE p.productid = #ProductId