Does SQL Server optimize repeated aggregate calculation in this example query? - sql-server-2008-r2

If I execute the following query in SQL Server 2008 R2, will the count(*) aggregate be determined only once for OUTER SELECT query or it will repeat for every record in OUTER SELECT?
I was guessing that SQL Server would be intelligent to see that the same calculation is being repeated and so it would do this calculation only once for optimization purpose. The value of TotalCount in query below is going to be the same for all rows in outer query.
SELECT
p.ProductId, p.ProductName,
(select count(*) from Products p1) as TotalCount
FROM Products p

No, you're expecting too much from SQL Server. Also: the query processor really cannot be sure that this value won't be changing over time - so it cannot really "optimize" this for you.
For every single row, this subquery will be executed once.
So if your SELECT statement will return 10 million rows, this count will be determined 10 million times.
If you don't want that, you can always run the select count(*).. once before the query and store the value into a SQL variable, and select that variable in your query:
DECLARE #TableCount INT
SELECT #TableCount = COUNT(*) FROM Products
SELECT
p.ProductId, p.ProductName, #TableCount
FROM
Products p

Related

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

select count(*) in WHERE Clause needing a comparison operator

Kinda new to SQL so I was reading up on some queries and chanced upon this (https://iggyfernandez.wordpress.com/2011/12/04/day-4-the-twelve-days-of-sql-there-way-you-write-your-query-matters/)
The part that got me curious is the aggregate query in the WHERE Clause. This is probably my misunderstanding but how does the author's code (shown below) run? I presumed that Count(*) - or rather aggregate functions cannot be used in the WHERE clause and you need a HAVING for that ?
SELECT per.empid, per.lname
FROM personnel per
WHERE (SELECT count(*) FROM payroll pay WHERE pay.empid = per.empid AND pay.salary = 199170) > 0;
My second question would be why the comparison operator (>0) is needed ? I was playing around and noticed that it would not run in PostgreSQL without the >0; also reformatting it to have a HAVING by Clause massively improves the query execution time
SELECT per.empid, per.lname
FROM personnel per
WHERE EXISTS (SELECT per.empid FROM payroll pay WHERE pay.empid = per.empid AND pay.salary = 199170)
GROUP BY per.empid, per.lname
HAVING COUNT(*) > 0;
Omit the GROUP BY and HAVING clauses in your version, then your query will be a more efficient query that is equivalent to the original.
In the original query, count(*) appears in the SELECT list of a subquery. You can use a parenthesized subquery almost anywhere in an SQL statement.

Full outer join with different WHERE clauses in Knex.js for PostgreSQL

I try to get a single row with two columns showing aggregation results: one column should show the total sum based on one WHERE-clause while the other column should show the total sum based on a different WHERE clause.
Desired output:
amount_vic amount_qld
100 70
In raw PostgreSQL I could write something like that:
select
sum(a.amount) as amount_vic,
sum(b.amount) as amount_qld
from mytable a
full outer join mytable b on 1=1
where a.state='vic' and b.state= 'qld'
Question: How do I write this or a similar query that returns the desired outcome in knex.js? For example: the 'on 1=1' probably needs knex.raw() and I think the table and column aliases do not work for me and it always returns some errors.
One of my not-working-attempts in knex.js:
knex
.sum({ amount_vic: 'a.amount' })
.sum({ amount_qld: 'b.amount' })
.from('mytable')
.as('a')
.raw('full outer join mytable on 1=1')
.as('b')
.where({
a.state: 'vic',
b.state: 'qld'
})
Thank you for your help.
Disclaimer: this does not answer the Knex part of the question - but it is too long for a comment.
Although your current query does what you want, the way it is phrased seems suboptimal. There is not need to generate a self-cartesian product here - which is what full join ... on 1 = 1 does. You can just use conditional aggregation.
In Postgres, you would phrase this as:
select
sum(amount) filter(where state = 'vic') amount_vic,
sum(amount) filter(where state = 'qld') amount_qld
from mytable
where state in ('vic', 'qld')
I don't know Knex so I cannot tell how to translate the query to it. Maybe this query is easier for you to translate.

Subquery count() for postgres query

I have two tables with web traffic that I am joining and visualizing on a map.
I am trying to write a counter that creates a query result with a count of the number of times a particular IP address shows up in the logs. I think this will take the form of a subquery that returns the count of rows of the specific row the main query is selecting.
When I run the following query, the error I get is more than one row returned by a subquery used as an expression.
select
squarespace_ip_addresses.ip,
squarespace_ip_addresses.latitude,
squarespace_ip_addresses.longitude,
st_SetSrid(ST_MAKEPOINT(squarespace_ip_addresses.longitude, squarespace_ip_addresses.latitude), 4326) as geom,
(select count(hostname) from squarespace_logs group by hostname) as counter,
squarespace_logs.referrer
from
squarespace_ip_addresses
left outer join
squarespace_logs
on
squarespace_ip_addresses.ip = squarespace_logs.hostname
Something that has been suggested to me is a subquery in a select that filters by the main query output runs that query for every row.
Does anyone have any insights?
Aggregate the data in a derived table (a subquery in FROM clause):
select
a.ip,
a.latitude,
a.longitude,
st_SetSrid(ST_MAKEPOINT(a.longitude, a.latitude), 4326) as geom,
c.count,
l.referrer
from squarespace_ip_addresses a
left join squarespace_logs l on a.ip = l.hostname
left join (
select hostname, count(*)
from squarespace_logs
group by hostname
) c on a.ip = c.hostname

way to reduce the cost in db2 for count(*)

Hi I had a DB2 Query as below
select count(*) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
For the above query i even tried with Count(1) but when i took the access plan, cost is same.
select count(1) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
Is there any other way to reduce the cost.
PS: b.xxx,b.yyy, c.wedf is indexed..
Thanks in advance.
I think one of the problem are statistics on the table. Did you execute Runstats? Probably, the data distribution or the quantity of rows that has to be read is a lot, and DB2 concludes that is better to read the whole table, instead of process an index, and then fetch the rows from the table.
It seems that both queries are taking the same access plan, and I think they are doing table scans.
Are the three columns part of the same index? or they are indexed separately? If they are part of different indexes, is there any ANDing between indexes in the access plan? If there is not ANDing with different indexes, the columns has to be read from the table in order to process the predicates.
The reason count(1) and count(*) are giving the same cost, is because both has to do a TableScan.
Please, take a look at the access plan, not only the results in timerons, but also the steps. Is the access plan taking the indexes? how many sorts is executing?
Try to change the optimization level, and you will see that the access plans change. I think you are executing with the default one (5)
If you want to force the query to take in account an index, you can create an optimization profile
What is the relation between (B,C) tables and A table. In your query you just use CROSS JOIN between A and (B,C). So it is the MAIN performance issue.
If you really need this count just multiply counts for A and (B,C):
select
(select count(*) from a)
*
(select count(*) from b, c where b.xxx=234 AND b.yyy=c.wedf )
for DB2 use this:
select a1.cnt*
(select count(*) as cnt2 from b, c where b.xxx=234 AND b.yyy=c.wedf )
from
(select count(*) as cnt1 from a) a1