How can I express the following subquery in Slick without a HAVING clause?:
select team, min_count
from (
select team, count(min) as min_count
from table
group by team
) t
where count > 500
Following the documented approach, groupBy followed by filter and map will cause HAVING:
select team, count(min) as min_count
from table
group by team
HAVING count(min) > 500
The reason for preferring a subquery is to avoid repeating the aggregate outside the select.
Related
I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.
This is a current postgres query I have:
sql = """
SELECT
vms.campaign_id,
avg(vms.open_rate_uplift) as open_rate_average,
avg(vms.click_rate_uplift) as click_rate_average,
avg(vms.conversion_rate_uplift) as conversion_rate_average,
avg(cms.incremental_opens),
avg(cms.incremental_clicks),
avg(cms.incremental_conversions)
FROM
experiments.variant_metric_snapshot vms
INNER JOIN experiments.campaign_metric_snapshot cms ON vms.campaign_id = cms.campaign_id
WHERE
vms.campaign_id IN %(campaign_ids)s
GROUP BY
vms.campaign_id
"""
whereby I get the average incremental_opens, incremental_clicks, and incremental_conversions per campaign group from the cms table. However, instead of the average, I want the most recent values for the 3 fields. See the cms table screenshot below - I want the values from the record with the greatest (i.e. most recent) event_id (instead of an average for all records) for a given group).
How can I do this? Thanks
It sounds like you want a lateral join.
FROM
experiments.variant_metric_snapshot vms
CROSS JOIN LATERAL (select * from experiments.campaign_metric_snapshot cms where vms.campaign_id = cms.campaign_id order by event_id desc LIMIT 1) cms
WHERE...
If you are after a quick and dirty solution you can use array_agg function with minimal change to your query.
SELECT
vms.campaign_id,
avg(vms.open_rate_uplift) as open_rate_average,
avg(vms.click_rate_uplift) as click_rate_average,
avg(vms.conversion_rate_uplift) as conversion_rate_average,
(array_agg(cms.incremental_opens ORDER BY cms.event_id DESC))[1] AS incremental_opens,
..
FROM
experiments.variant_metric_snapshot vms
INNER JOIN experiments.campaign_metric_snapshot cms ON vms.campaign_id = cms.campaign_id
WHERE
vms.campaign_id IN %(campaign_ids)s
GROUP BY
vms.campaign_id;
Kinda new to SQL so I was reading up on some queries and chanced upon this (https://iggyfernandez.wordpress.com/2011/12/04/day-4-the-twelve-days-of-sql-there-way-you-write-your-query-matters/)
The part that got me curious is the aggregate query in the WHERE Clause. This is probably my misunderstanding but how does the author's code (shown below) run? I presumed that Count(*) - or rather aggregate functions cannot be used in the WHERE clause and you need a HAVING for that ?
SELECT per.empid, per.lname
FROM personnel per
WHERE (SELECT count(*) FROM payroll pay WHERE pay.empid = per.empid AND pay.salary = 199170) > 0;
My second question would be why the comparison operator (>0) is needed ? I was playing around and noticed that it would not run in PostgreSQL without the >0; also reformatting it to have a HAVING by Clause massively improves the query execution time
SELECT per.empid, per.lname
FROM personnel per
WHERE EXISTS (SELECT per.empid FROM payroll pay WHERE pay.empid = per.empid AND pay.salary = 199170)
GROUP BY per.empid, per.lname
HAVING COUNT(*) > 0;
Omit the GROUP BY and HAVING clauses in your version, then your query will be a more efficient query that is equivalent to the original.
In the original query, count(*) appears in the SELECT list of a subquery. You can use a parenthesized subquery almost anywhere in an SQL statement.
Can anyone please help me in writing a single query joining these two queries.
I am using IBM DB2.
(SELECT
TABLE1.COLS,TBLE2.COLS,TABLE3.COLS
FROM
TABLE1,TABLE2,TABLE3,TABLE_PROB
WHERE
TABLE_PROB.COL=TABLE1.COL,OTHER_CLAUSE )
UNION
(SELECT
TABLE1.COLS,TBLE2.COLS,TABLE3.COLS
FROM
TABLE1,TABLE2,TABLE3,TABLE_PROB1
WHERE TABLE_PROB1.COL=TABLE1.COL,OTHER_CLAUSE )
The two queries before and after union are same except that instead of "TABLE_PROB" it is changed to "TABLE_PROB1". There are no columns is to be selected from both the tables, they are only used to filter in the where clause.
Can anyone tell me how to combine both into a single query.
This query can be considered for the following scenario.
There are few employee details table which contains details of all employees.
"TABLE_PROB" contains list of contract employees and "TABLE_PROB1" contains list of permanent employees. I need to get the details of both the contract and not contract employees based on few criteria.
Since the query has big Whereclause and select clause firing two queries by using union,increases the cost of the query. So I need to merge it by making a single query.
Thanks for the help in advance.
You cannot avoid the UNION because you still have to access both TABLE_PROB and TABLE_PROB1. Depending on your DB2 version, platform, and the system configuration this might perform a bit better:
SELECT
TABLE1.COLS,TBLE2.COLS,TABLE3.COLS
FROM
TABLE1,TABLE2,TABLE3
WHERE
OTHER_CLAUSE
AND
EXISTS (
SELECT 1
FROM TABLE_PROB
WHERE COL=TABLE1.COL
UNION
SELECT 1
FROM TABLE_PROB1
WHERE COL=TABLE1.COL
)
Depending on the contents of TABLE_PROB.COL and TABLE_PROB1.COL UNION ALL instead of UNION might also prove beneficial.
I'm rewriting the MySQL queries to PostgreSQL. I have table with articles and another table with categories. I need to select all categories, which has at least 1 article:
SELECT c.*,(
SELECT COUNT(*)
FROM articles a
WHERE a."active"=TRUE AND a."category_id"=c."id") "count_articles"
FROM articles_categories c
HAVING (
SELECT COUNT(*)
FROM articles a
WHERE a."active"=TRUE AND a."category_id"=c."id" ) > 0
I don't know why, but this query is causing an error:
ERROR: column "c.id" must appear in the GROUP BY clause or be used in an aggregate function at character 8
The HAVING clause is a bit tricky to understand. I'm not sure about how MySQL interprets it. But the Postgres documentation can be found here:
http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-HAVING
It essentially says:
The presence of HAVING turns a query
into a grouped query even if there is
no GROUP BY clause. This is the same
as what happens when the query
contains aggregate functions but no
GROUP BY clause. All the selected rows
are considered to form a single group,
and the SELECT list and HAVING clause
can only reference table columns from
within aggregate functions. Such a
query will emit a single row if the
HAVING condition is true, zero rows if
it is not true.
The same is also explained in this blog post, which shows how HAVING without GROUP BY implicitly implies a SQL:1999 standard "grand total", i.e. a GROUP BY ( ) clause (which isn't supported in PostgreSQL)
Since you don't seem to want a single row, the HAVING clause might not be the best choice.
Considering your actual query and your requirement, just rewrite the whole thing and JOIN articles_categories to articles:
SELECT DISTINCT c.*
FROM articles_categories c
JOIN articles a
ON a.active = TRUE
AND a.category_id = c.id
alternative:
SELECT *
FROM articles_categories c
WHERE EXISTS (SELECT 1
FROM articles a
WHERE a.active = TRUE
AND a.category_id = c.id)
SELECT * FROM categories c
WHERE
EXISTS (SELECT 1 FROM article a WHERE c.id = a.category_id);
should be fine... perhaps simpler ;)