Reform a postgreSQL query using one (or more) index - Example - postgresql

I am a beginner in PostgreSQL and, after understanding very basic things, I want to find out how I can get a better performance (on a query) by using an index (one or more). I have read some documentation, but I would like a specific example so as to "catch" it.
MY EXAMPLE: Let's say I have just a table (MyTable) with three columns (Customer(text), Time(timestamp), Consumption(integer)) and I want to find the customer(s) with the maximum consumption on '2014-07-01 01:00:00'. MY SOLUTION (without index usage):
SELECT Customer FROM MyTable WHERE Time='2013-07-01 02:00:00'
AND Consumption=(SELECT MAX(consumption) FROM MyTable);
----> What would be the exact full code, using - at least one - index for the query-example above ?

The correct query (using a correlated subquery) would be:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00' AND
Consumption = (SELECT MAX(t2.consumption) FROM MyTable t2 WHERE t2.Time = '2013-07-01 02:00:00');
The above is very reasonable. An alternative approach if you want exactly one row returned is:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00'
ORDER BY Consumption DESC
LIMIT 1;
And the best index is MyTable(Time, Consumption, Customer).

Related

PostgREST using limit and offset in subqueries or CTE

we are using PostgREST in our project for some quite complex database views.
From some point on, when we are using limit and offset (x-range headers or query parameters) with sub-selects we get very high response times.
From what we have read, it seems like this is a known issue where postgresql executes the sub-selects even for the records which are not requested. The solution would be to jiggle a little with the offset and limit, putting it in a subselect or a CTE table.
Is there an internal GUC value or something similar that we can use in the database views in order to optimize the response times ? Does anybody have a hint on how to achieve this ?
EDIT: as suggested here are some more details. Let's say we have a relationship between product and parts. I want to know the parts count per product (this is a simplified version of the database views that we are exposing).
There are two ways of doing this
A. Subselect:
SELECT products.id
,(
SELECT count(part_id) AS total
FROM parts
WHERE product_id = products.id
)
FROM products limit 1000 OFFSET 99000
B. CTE:
WITH parts_count
AS (
SELECT product_id
,count(part_id) AS total
FROM parts
GROUP BY product_id
ORDER BY product_id
)
SELECT products.id
,parts_count.total
FROM products
LEFT JOIN parts_count ON parts_count.product_id = product.id
LIMIT 1000
OFFSET 99000
Problem with A is that the sub-select is performed for every row so even though I read only 1000 records there are 100 000 subselects.
Problem with B is that the join with parts_count table takes very long since there are 100 0000 records there (although the with query takes only 200 ms! for 2000 records). Ideally I would like to limit the parts_count table with the same limit and offset as the main query but I can't do this in PostgREST since it just appends the limit and offset at the end, I don't have access to those parameters inside the WITH query
It is unavoidable that high OFFSET leads to bad performance.
There is no other way to compute OFFSET but to scan and discard all the rows until you reach the offset, and no database in the world will be fast if OFFSET is high.
That's a conceptual problem, and the only way to avoid it is to avoid OFFSET.
If your goal is pagination, then usually keyset pagination is a better solution:
You add an ORDER BY clause that matches your requirements, make sure there is a unique key in the ORDER BY clause and remember the last value you found. To fetch the next page, add a WHERE condition with that values. With proper index support, this can be very fast.
For your query, a more efficient version is probably:
SELECT p.id
count(parts.part_id) AS total
FROM (SELECT id FROM products
LIMIT 1000 OFFSET 99000) p
LEFT JOIN parts ON parts.product_id = p.id
GROUP BY p.id;
It is rather weird that you have no ORDER BY, but LIMIT and OFFSET. That doesn't make much sense.

Most efficient way to retrieve rows of related data: subquery, or separate query with GROUP BY?

I have a very simple PostgreSQL query to retrieve the latest 50 news articles:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Now I also want to retrieve the latest 10 comments for each article as well. I can think of two ways to accomplish retrieving them and I'm not sure which one is best in the context of PostgreSQL:
Option 1:
Do a subquery directly for the comments in the original query and cast the result to an array:
SELECT headline, author_name, body,
ARRAY(
SELECT id, message, author_name,
FROM news_comments
WHERE news_id = n.id
ORDER BY DATE DESC
LIMIT 10
) AS comments
FROM news n
ORDER BY publish_date DESC
LIMIT 50
Obviously, in this case, application logic would need to be aware of which index in the array is which column, that's no problem.
The one problem I see with the method is not knowing how the query planner would execute it. Would this effectively turn into 51 queries?
Option 2:
Use the original very simple query:
SELECT id, headline, author_name, body
FROM news
ORDER BY publish_date DESC
LIMIT 50
Then via application logic, gather all of the news ids and use those in a separate query, row_number() would have to be used here in order to limit the number of results per news article:
SELECT *
FROM (
SELECT *,
row_number() OVER(
PARTITION BY author_id
ORDER BY author_id DESC
) AS rn
FROM (
SELECT *
FROM news_comment
WHERE news_id IN(123, 456, 789)
) s
) s
where rn <= 10
This approach is obviously more complicated, and I'm not sure if this would have to retrieve all comments for the scoped news articles first, then chop off the ones where the row count is great than 10.
Which option is best? Or is there an even better solution I have overlooked?
For context, this is a news aggregator site I've developed myself, I currently have about 40,000 news articles across several categories, with about 500,000 comments, so I'm looking for the best solution to help me keep growing.
You should investigate execution plan for your statements using at least EXPLAIN ANALYZE. This will provide you with plan chosen by the optimizer while executing the statement itself and giving you back actual run times and other statistics as well.
Another solution would be to use LATERAL subquery to retrieve 10 comments for each news in separate rows, but then again - you need to investigate and compare plans to choose the best approach that works for you:
SELECT
n.id, n.headline, n.uathor_name, n.body,
c.id, c.message, c.author_name
FROM news n
LEFT JOIN LATERAL (
SELECT id, message, author_name
FROM news_comments nc
WHERE n.id = nc.news_id
ORDER BY nc.date DESC
LIMIT 10
) c ON TRUE
ORDER BY publish_date DESC
LIMIT 50
When your query contains LATERAL cross-references for each row retrieved from news LATERAL is evaluated using the connection in WHERE clause. Thus making it a repeated execution and joining the information retrieved from it for each row from your source table news.
This approach would save the time needed for your application logic to deal with arrays coming out from option 1 while not having to issue many separate queries for each news like in option 2 saving you (in this case) time needed to open separate transactions, establish connections, retrieve rows etc...
It would be good to look for performance improvements by creating indexes and looking into planner cost constans and planner method configuration parameters that you can experiment with to understand the choice planner has made. More on the subject here.

Postgresql 9.4 slow [duplicate]

I have table
create table big_table (
id serial primary key,
-- other columns here
vote int
);
This table is very big, approximately 70 million rows, I need to query:
SELECT * FROM big_table
ORDER BY vote [ASC|DESC], id [ASC|DESC]
OFFSET x LIMIT n -- I need this for pagination
As you may know, when x is a large number, queries like this are very slow.
For performance optimization I added indexes:
create index vote_order_asc on big_table (vote asc, id asc);
and
create index vote_order_desc on big_table (vote desc, id desc);
EXPLAIN shows that the above SELECT query uses these indexes, but it's very slow anyway with a large offset.
What can I do to optimize queries with OFFSET in big tables? Maybe PostgreSQL 9.5 or even newer versions have some features? I've searched but didn't find anything.
A large OFFSET is always going to be slow. Postgres has to order all rows and count the visible ones up to your offset. To skip all previous rows directly you could add an indexed row_number to the table (or create a MATERIALIZED VIEW including said row_number) and work with WHERE row_number > x instead of OFFSET x.
However, this approach is only sensible for read-only (or mostly) data. Implementing the same for table data that can change concurrently is more challenging. You need to start by defining desired behavior exactly.
I suggest a different approach for pagination:
SELECT *
FROM big_table
WHERE (vote, id) > (vote_x, id_x) -- ROW values
ORDER BY vote, id -- needs to be deterministic
LIMIT n;
Where vote_x and id_x are from the last row of the previous page (for both DESC and ASC). Or from the first if navigating backwards.
Comparing row values is supported by the index you already have - a feature that complies with the ISO SQL standard, but not every RDBMS supports it.
CREATE INDEX vote_order_asc ON big_table (vote, id);
Or for descending order:
SELECT *
FROM big_table
WHERE (vote, id) < (vote_x, id_x) -- ROW values
ORDER BY vote DESC, id DESC
LIMIT n;
Can use the same index.
I suggest you declare your columns NOT NULL or acquaint yourself with the NULLS FIRST|LAST construct:
PostgreSQL sort by datetime asc, null first?
Note two things in particular:
The ROW values in the WHERE clause cannot be replaced with separated member fields. WHERE (vote, id) > (vote_x, id_x) cannot be replaced with:
WHERE vote >= vote_x
AND id > id_x
That would rule out all rows with id <= id_x, while we only want to do that for the same vote and not for the next. The correct translation would be:
WHERE (vote = vote_x AND id > id_x) OR vote > vote_x
... which doesn't play along with indexes as nicely, and gets increasingly complicated for more columns.
Would be simple for a single column, obviously. That's the special case I mentioned at the outset.
The technique does not work for mixed directions in ORDER BY like:
ORDER BY vote ASC, id DESC
At least I can't think of a generic way to implement this as efficiently. If at least one of both columns is a numeric type, you could use a functional index with an inverted value on (vote, (id * -1)) - and use the same expression in ORDER BY:
ORDER BY vote ASC, (id * -1) ASC
Related:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Improve performance for order by with columns from many tables
Note in particular the presentation by Markus Winand I linked to:
"Pagination done the PostgreSQL way"
Have you tried partioning the table ?
Ease of management, improved scalability and availability, and a
reduction in blocking are common reasons to partition tables.
Improving query performance is not a reason to employ partitioning,
though it can be a beneficial side-effect in some cases. In terms of
performance, it is important to ensure that your implementation plan
includes a review of query performance. Confirm that your indexes
continue to appropriately support your queries after the table is
partitioned, and verify that queries using the clustered and
nonclustered indexes benefit from partition elimination where
applicable.
http://sqlperformance.com/2013/09/sql-indexes/partitioning-benefits

PostgreSQL Aggregate groups with limit performance

I'm a newbie in PostgreSQL. Is there a way to improve execution time of the following query:
SELECT s.id, s.name, s.url,
(SELECT array_agg(p.url)
FROM (
SELECT url
FROM pages
WHERE site_id = s.id ORDER BY created DESC LIMIT 5
) as p
) as last_pages
FROM sites s
I havn't found how to insert LIMIT clause into aggregate call, as ordering.
There are indexes by created (timestamp) and site_id (integer) in table pages, but the foreign key from sites.id to pages.site_id is absent, unfortunately. The query is intented to return a list of sites with sublists of 5 most recently created pages.
PostgreSQL version is 9.1.5.
You need to start by thinking like the database management system. You also need to think very carefully about what you are asking from the database.
Your fundamental problem here is that you likely have a very large number of separate indexing calls happening here when a sequential scan may be quite a bit faster. Your current query gives very little flexibility to the planner because of the fact that you have subqueries which must be correlated.
A much better way to do this would be with a view (inline or not) and a window function:
SELECT s.id, s.name, s.url, array_agg(p.url)
FROM sites s
JOIN (select site_id, url,
row_number() OVER (partition by site_id order by created desc) as num
from pages) p on s.id = p.site_id
WHERE num <= 5;
This will likely change a very large number of index scans to a single large sequential scan.

Equivalent of LIMIT for DB2

How do you do LIMIT in DB2 for iSeries?
I have a table with more than 50,000 records and I want to return records 0 to 10,000, and records 10,000 to 20,000.
I know in SQL you write LIMIT 0,10000 at the end of the query for 0 to 10,000 and LIMIT 10000,10000 at the end of the query for 10000 to 20,000
So, how is this done in DB2? Whats the code and syntax?
(full query example is appreciated)
Using FETCH FIRST [n] ROWS ONLY:
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.db29.doc.perf/db2z_fetchfirstnrows.htm
SELECT LASTNAME, FIRSTNAME, EMPNO, SALARY
FROM EMP
ORDER BY SALARY DESC
FETCH FIRST 20 ROWS ONLY;
To get ranges, you'd have to use ROW_NUMBER() (since v5r4) and use that within the WHERE clause: (stolen from here: http://www.justskins.com/forums/db2-select-how-to-123209.html)
SELECT code, name, address
FROM (
SELECT row_number() OVER ( ORDER BY code ) AS rid, code, name, address
FROM contacts
WHERE name LIKE '%Bob%'
) AS t
WHERE t.rid BETWEEN 20 AND 25;
Developed this method:
You NEED a table that has an unique value that can be ordered.
If you want rows 10,000 to 25,000 and your Table has 40,000 rows, first you need to get the starting point and total rows:
int start = 40000 - 10000;
int total = 25000 - 10000;
And then pass these by code to the query:
SELECT * FROM
(SELECT * FROM schema.mytable
ORDER BY userId DESC fetch first {start} rows only ) AS mini
ORDER BY mini.userId ASC fetch first {total} rows only
Support for OFFSET and LIMIT was recently added to DB2 for i 7.1 and 7.2. You need the following DB PTF group levels to get this support:
SF99702 level 9 for IBM i 7.2
SF99701 level 38 for IBM i 7.1
See here for more information: OFFSET and LIMIT documentation, DB2 for i Enhancement Wiki
Here's the solution I came up with:
select FIELD from TABLE where FIELD > LASTVAL order by FIELD fetch first N rows only;
By initializing LASTVAL to 0 (or '' for a text field), then setting it to the last value in the most recent set of records, this will step through the table in chunks of N records.
#elcool's solution is a smart idea, but you need to know total number of rows (which can even change while you are executing the query!). So I propose a modified version, which unfortunately needs 3 subqueries instead of 2:
select * from (
select * from (
select * from MYLIB.MYTABLE
order by MYID asc
fetch first {last} rows only
) I
order by MYID desc
fetch first {length} rows only
) II
order by MYID asc
where {last} should be replaced with row number of the last record I need and {length} should be replaced with the number of rows I need, calculated as last row - first row + 1.
E.g. if I want rows from 10 to 25 (totally 16 rows), {last} will be 25 and {length} will be 25-10+1=16.
Try this
SELECT * FROM
(
SELECT T.*, ROW_NUMBER() OVER() R FROM TABLE T
)
WHERE R BETWEEN 10000 AND 20000
The LIMIT clause allows you to limit the number of rows returned by the query. The LIMIT clause is an extension of the SELECT statement that has the following syntax:
SELECT select_list
FROM table_name
ORDER BY sort_expression
LIMIT n [OFFSET m];
In this syntax:
n is the number of rows to be returned.
m is the number of rows to skip before returning the n rows.
Another shorter version of LIMIT clause is as follows:
LIMIT m, n;
This syntax means skipping m rows and returning the next n rows from the result set.
A table may store rows in an unspecified order. If you don’t use the ORDER BY clause with the LIMIT clause, the returned rows are also unspecified. Therefore, it is a good practice to always use the ORDER BY clause with the LIMIT clause.
See Db2 LIMIT for more details.
You should also consider the OPTIMIZE FOR n ROWS clause. More details on all of this in the DB2 LUW documentation in the Guidelines for restricting SELECT statements topic:
The OPTIMIZE FOR clause declares the intent to retrieve only a subset of the result or to give priority to retrieving only the first few rows. The optimizer can then choose access plans that minimize the response time for retrieving the first few rows.
There are 2 solutions to paginate efficiently on a DB2 table :
1 - the technique using the function row_number() and the clause OVER which has been presented on another post ("SELECT row_number() OVER ( ORDER BY ... )"). On some big tables, I noticed sometimes a degradation of performances.
2 - the technique using a scrollable cursor. The implementation depends of the language used. That technique seems more robust on big tables.
I presented the 2 techniques implemented in PHP during a seminar next year. The slide is available on this link :
http://gregphplab.com/serendipity/uploads/slides/DB2_PHP_Best_practices.pdf
Sorry but this document is only in french.
Theres these available options:-
DB2 has several strategies to cope with this problem.
You can use the "scrollable cursor" in feature.
In this case you can open a cursor and, instead of re-issuing a query you can FETCH forward and backward.
This works great if your application can hold state since it doesn't require DB2 to rerun the query every time.
You can use the ROW_NUMBER() OLAP function to number rows and then return the subset you want.
This is ANSI SQL
You can use the ROWNUM pseudo columns which does the same as ROW_NUMBER() but is suitable if you have Oracle skills.
You can use LIMIT and OFFSET if you are more leaning to a mySQL or PostgreSQL dialect.