PostgreSQL Aggregate groups with limit performance - postgresql

I'm a newbie in PostgreSQL. Is there a way to improve execution time of the following query:
SELECT s.id, s.name, s.url,
(SELECT array_agg(p.url)
FROM (
SELECT url
FROM pages
WHERE site_id = s.id ORDER BY created DESC LIMIT 5
) as p
) as last_pages
FROM sites s
I havn't found how to insert LIMIT clause into aggregate call, as ordering.
There are indexes by created (timestamp) and site_id (integer) in table pages, but the foreign key from sites.id to pages.site_id is absent, unfortunately. The query is intented to return a list of sites with sublists of 5 most recently created pages.
PostgreSQL version is 9.1.5.

You need to start by thinking like the database management system. You also need to think very carefully about what you are asking from the database.
Your fundamental problem here is that you likely have a very large number of separate indexing calls happening here when a sequential scan may be quite a bit faster. Your current query gives very little flexibility to the planner because of the fact that you have subqueries which must be correlated.
A much better way to do this would be with a view (inline or not) and a window function:
SELECT s.id, s.name, s.url, array_agg(p.url)
FROM sites s
JOIN (select site_id, url,
row_number() OVER (partition by site_id order by created desc) as num
from pages) p on s.id = p.site_id
WHERE num <= 5;
This will likely change a very large number of index scans to a single large sequential scan.

Related

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

Refactoring query using DISTINCT and JOINing table with a lot of records

I am using PostgreSQL v 11.6. I've read a lot of questions asking about how to optimize queries which are using DISTINCT. Mine is not that different, but despite the other questions where the people usually want's to keep the other part of the query and just somehow make DISTINCT ON faster, I am willing to rewrite the query with the sole purpose to make it as performent as possible. The current query is this:
SELECT DISTINCT s.name FROM app.source AS s
INNER JOIN app.index_value iv ON iv.source_id = s.id
INNER JOIN app.index i ON i.id = iv.index_id
INNER JOIN app.namespace AS ns ON i.namespace_id=ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
ORDER BY s.name;
The app.source table contains about 800 records. The other tables are under 5000 recrods tops, but the app.index_value contains 35_420_354 (about 35 million records) which I guess causes the overall slow execution of the query.
The EXPLAIN ANALYZE returns this:
I think that all relevent indexes are in place (maybe there can be made some small optimization) but I think that in order to get significant improvements in the time execution I need a better logic for the query.
The current execution time on a decent machine is 35~38 seconds.
Your query is not using DISTINCT ON. It is merely using DISTINCT which is quite a different thing.
SELECT DISTINCT is indeed often an indicator for a oorly written query, because DISTINCT is used to remove duplicates and it is often the case tat the query creates those duplicates itself. The same is true for your query. You simply want all names where certain entries exist. So, use EXISTS (or IN for that matter).
EXISTS
SELECT s.name
FROM app.source AS s
WHERE EXISTS
(
SELECT NULL
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE iv.source_id = s.id
AND (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
IN
SELECT s.name
FROM app.source AS s
WHERE s.id IN
(
SELECT iv.source_id
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
Thus we avoid creating an unnecessarily large intermediate result.
Update 1
From the database side we can support queries with appropriate indexes. The only criteria used in your query that limits selected rows is the array lookup, though. This is probably slow, because the DBMS cannot use database indexes here as far as I know. And depending on the array content we can end up with zero app.namespace rows, few rows, many rows or even all rows. The DBMS cannot even make proper assumptions on know how many. From there we'll retrieve the related index and index_value rows. Again, these can be all or none. The DBMS could use indexes here or not. If it used indexes this would be very fast on small sets of rows and extremely slow on large data sets. And if it used full table scans and joined these via hash joins for instance, this would be the fastest approach for many rows and rather slow on few rows.
You can create indexes and see whether they get used or not. I suggest:
create index idx1 on app.index (namespace_id, id);
create index idx2 on app.index_value (index_id, source_id);
create index idx3 on app.source (id, name);
Update 2
I am not versed with arrays. But t looks like you want to check if a matching condition exists. So again EXISTS might be a tad more appropriate:
WHERE EXISTS
(
SELECT NULL
FROM UNNEST(Array['Default']::CITEXT[]) AS nss
WHERE ns.name ILIKE nss
)
Update 3
One more idea (I feel stupid now to have missed that): For each source we just look up whether there is at least one match. So maybe the DBMS starts with the source table and goes from that table to the next. For this we'd use the following indexes:
create index idx4 on index_value (source_id, index_id);
create index idx5 on index (id, namespace_id);
create index idx6 on namespace (id, name);
Just add them to your database and see what happens. You can always drop indexes again when you see the DBMS doesn't use them.

How can I query for the previous claim for each claim. Window Function PostgreSQL

I have a table of claims and I want to attach each patients' previous claim. I've been able to do it with
a select statement, but my dataset is 50+ million records and I'm hoping that there is a more efficient way to do this. From my understanding, this query will need to scan the full table each time for each record. Would a window function be better? Could sorting the large table help at all?
http://www.sqlfiddle.com/#!17/09a53/6/0
select
(select b."fill_date" from t1 b
where b.user_id = a.user_id and b.fill_date < a.fill_date
order by b.fill_date desc
limit 1) as prior_fill_date,
a.* from t2 a
Thanks for the help
Please give this a try:
select *,
lag(fill_date)
over (partition by user_id order by fill_date)
as prior_fill_date
from "sql_notebook_results_T42E95sESnn0"
order by user_id, fill_date;
This sorts only once. If performance is still not good enough, then you will need to look at adding an index on (user_id, fill_date).

Most efficient way to retrieve rows of related data: subquery, or separate query with GROUP BY?

I have a very simple PostgreSQL query to retrieve the latest 50 news articles:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Now I also want to retrieve the latest 10 comments for each article as well. I can think of two ways to accomplish retrieving them and I'm not sure which one is best in the context of PostgreSQL:
Option 1:
Do a subquery directly for the comments in the original query and cast the result to an array:
SELECT headline, author_name, body,
ARRAY(
SELECT id, message, author_name,
FROM news_comments
WHERE news_id = n.id
ORDER BY DATE DESC
LIMIT 10
) AS comments
FROM news n
ORDER BY publish_date DESC
LIMIT 50
Obviously, in this case, application logic would need to be aware of which index in the array is which column, that's no problem.
The one problem I see with the method is not knowing how the query planner would execute it. Would this effectively turn into 51 queries?
Option 2:
Use the original very simple query:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Then via application logic, gather all of the news ids and use those in a separate query, row_number() would have to be used here in order to limit the number of results per news article:
SELECT *
FROM (
SELECT *,
row_number() OVER(
PARTITION BY author_id
ORDER BY author_id DESC
) AS rn
FROM (
SELECT *
FROM news_comment
WHERE news_id IN(123, 456, 789)
) s
) s
where rn <= 10
This approach is obviously more complicated, and I'm not sure if this would have to retrieve all comments for the scoped news articles first, then chop off the ones where the row count is great than 10.
Which option is best? Or is there an even better solution I have overlooked?
For context, this is a news aggregator site I've developed myself, I currently have about 40,000 news articles across several categories, with about 500,000 comments, so I'm looking for the best solution to help me keep growing.
You should investigate execution plan for your statements using at least EXPLAIN ANALYZE. This will provide you with plan chosen by the optimizer while executing the statement itself and giving you back actual run times and other statistics as well.
Another solution would be to use LATERAL subquery to retrieve 10 comments for each news in separate rows, but then again - you need to investigate and compare plans to choose the best approach that works for you:
SELECT
n.id, n.headline, n.uathor_name, n.body,
c.id, c.message, c.author_name
FROM news n
LEFT JOIN LATERAL (
SELECT id, message, author_name
FROM news_comments nc
WHERE n.id = nc.news_id
ORDER BY nc.date DESC
LIMIT 10
) c ON TRUE
ORDER BY publish_date DESC
LIMIT 50
When your query contains LATERAL cross-references for each row retrieved from news LATERAL is evaluated using the connection in WHERE clause. Thus making it a repeated execution and joining the information retrieved from it for each row from your source table news.
This approach would save the time needed for your application logic to deal with arrays coming out from option 1 while not having to issue many separate queries for each news like in option 2 saving you (in this case) time needed to open separate transactions, establish connections, retrieve rows etc...
It would be good to look for performance improvements by creating indexes and looking into planner cost constans and planner method configuration parameters that you can experiment with to understand the choice planner has made. More on the subject here.

way to reduce the cost in db2 for count(*)

Hi I had a DB2 Query as below
select count(*) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
For the above query i even tried with Count(1) but when i took the access plan, cost is same.
select count(1) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
Is there any other way to reduce the cost.
PS: b.xxx,b.yyy, c.wedf is indexed..
Thanks in advance.
I think one of the problem are statistics on the table. Did you execute Runstats? Probably, the data distribution or the quantity of rows that has to be read is a lot, and DB2 concludes that is better to read the whole table, instead of process an index, and then fetch the rows from the table.
It seems that both queries are taking the same access plan, and I think they are doing table scans.
Are the three columns part of the same index? or they are indexed separately? If they are part of different indexes, is there any ANDing between indexes in the access plan? If there is not ANDing with different indexes, the columns has to be read from the table in order to process the predicates.
The reason count(1) and count(*) are giving the same cost, is because both has to do a TableScan.
Please, take a look at the access plan, not only the results in timerons, but also the steps. Is the access plan taking the indexes? how many sorts is executing?
Try to change the optimization level, and you will see that the access plans change. I think you are executing with the default one (5)
If you want to force the query to take in account an index, you can create an optimization profile
What is the relation between (B,C) tables and A table. In your query you just use CROSS JOIN between A and (B,C). So it is the MAIN performance issue.
If you really need this count just multiply counts for A and (B,C):
select
(select count(*) from a)
*
(select count(*) from b, c where b.xxx=234 AND b.yyy=c.wedf )
for DB2 use this:
select a1.cnt*
(select count(*) as cnt2 from b, c where b.xxx=234 AND b.yyy=c.wedf )
from
(select count(*) as cnt1 from a) a1