When are the offset and limit keywords executed in a Postgresql query? - postgresql

I would like to understand when the offset and limit statements execute in a Postgresql query. Given a query with a format such as
select
a.*,
(-- some subquery here) as sub_query_result
from some_table a
where -- some condition
offset :offset
limit :limit
My understanding is the table will first be filtered using the where statement, and then the remaining rows will be projected into the form as defined by select statement.
Do the offset and limit statements execute after all operations have occurred in the select statement? Or does it apply the where, offset, and limit statements first and then the select part of the query?
I am hoping that it applies the where, offset, and limit statements first that if I had a result set say of 10,000 rows, and I only want the 2nd page of 1000, it would only execute the subquery 1000 times, for example.

A query with LIMIT but without ORDER BY makes a little sense. From the documentation:
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows.
When the ORDER BY clause is present, expressions in the select list (including subqueries or functions) must be evaluated for as many rows as needed to determine the proper order. In best scenarios the number of computed rows may be limited to the sum LIMIT + OFFSET, if the sum is less than the number of filtered rows. This means that (in some simplification) the greater OFFSET the longer the query is run:
The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient.
In some cases there may be optimizations when the planner will recognize an expression as immutable, but generally you should expect that a subquery will be executed at least LIMIT + OFFSET times. In Postgres 9.5 or earlier the number of computed rows may be even larger if the ordering is not based on an index.

Related

PostgREST using limit and offset in subqueries or CTE

we are using PostgREST in our project for some quite complex database views.
From some point on, when we are using limit and offset (x-range headers or query parameters) with sub-selects we get very high response times.
From what we have read, it seems like this is a known issue where postgresql executes the sub-selects even for the records which are not requested. The solution would be to jiggle a little with the offset and limit, putting it in a subselect or a CTE table.
Is there an internal GUC value or something similar that we can use in the database views in order to optimize the response times ? Does anybody have a hint on how to achieve this ?
EDIT: as suggested here are some more details. Let's say we have a relationship between product and parts. I want to know the parts count per product (this is a simplified version of the database views that we are exposing).
There are two ways of doing this
A. Subselect:
SELECT products.id
,(
SELECT count(part_id) AS total
FROM parts
WHERE product_id = products.id
)
FROM products limit 1000 OFFSET 99000
B. CTE:
WITH parts_count
AS (
SELECT product_id
,count(part_id) AS total
FROM parts
GROUP BY product_id
ORDER BY product_id
)
SELECT products.id
,parts_count.total
FROM products
LEFT JOIN parts_count ON parts_count.product_id = product.id
LIMIT 1000
OFFSET 99000
Problem with A is that the sub-select is performed for every row so even though I read only 1000 records there are 100 000 subselects.
Problem with B is that the join with parts_count table takes very long since there are 100 0000 records there (although the with query takes only 200 ms! for 2000 records). Ideally I would like to limit the parts_count table with the same limit and offset as the main query but I can't do this in PostgREST since it just appends the limit and offset at the end, I don't have access to those parameters inside the WITH query
It is unavoidable that high OFFSET leads to bad performance.
There is no other way to compute OFFSET but to scan and discard all the rows until you reach the offset, and no database in the world will be fast if OFFSET is high.
That's a conceptual problem, and the only way to avoid it is to avoid OFFSET.
If your goal is pagination, then usually keyset pagination is a better solution:
You add an ORDER BY clause that matches your requirements, make sure there is a unique key in the ORDER BY clause and remember the last value you found. To fetch the next page, add a WHERE condition with that values. With proper index support, this can be very fast.
For your query, a more efficient version is probably:
SELECT p.id
count(parts.part_id) AS total
FROM (SELECT id FROM products
LIMIT 1000 OFFSET 99000) p
LEFT JOIN parts ON parts.product_id = p.id
GROUP BY p.id;
It is rather weird that you have no ORDER BY, but LIMIT and OFFSET. That doesn't make much sense.

Which row is returned using DISTINCT ON in Postgres

When I use DISTINCT ON in PostgreSQL (distinct in django) which rows are retrieved in the group of rows with same fields?
The documentation says:
A set of rows for which all the expressions are equal are considered
duplicates, and only the first row of the set is kept in the output.
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the
rows arriving at the DISTINCT filter.
So if you add an ORDER BY clause, the first row in that order is kept.
Without an ORDER BY clause, there is no way of telling which row will be kept.

Tsql, union changes result order, union all doesn't

I know UNION removes duplicates but it changes result order even when there are no duplicates.
I have two select statements, no order by statement anywhere
I want union them with or without (all)
i.e.
SELECT A
UNION (all)
SELECT B
"Select B" actually contains nothing, no entry will be returned
if I use "Select A union Select B", the order of the result is different from just "Select A"
if I use:
SELECT A
UNION ALL
SELECT B
the order of the result is the same as "Select A" itself and there are no duplicates in "Select A" at all.
Why is this? it is unpredictable.
The only way to get a particular order of results from an SQL query is to use an ORDER BY clause. Anything else is just relying on coincidence and the particular (transitory) state of the server at the time you issue your query.
So if you want/need a particular order, use an ORDER BY.
As to why it changes the ordering of results - first, UNION (without ALL) guarantees to remove all duplicates from the result - not just duplicates arising from the different queries - so if the first query returns duplicate rows and the second query returns no rows, UNION still has to eliminate them.
One common, easy way to determine whether you have duplicates in a bag of results is to sort those results (in whatever sort order is most convenient to the system) - in this way, duplicates end up next to each other and so you can then just iterate over these sorted results and if(results[index] == results[index-1]) skip;.
So, you'll commonly find that the results of a UNION (without ALL) query have been sorted - in some arbitrary order. But, to re-emphasise the original point, what ordering was applied is not defined, and certainly shouldn't be relied upon - any patches to the software, changes in indexes or statistics may result in the system choosing a different sort order the next time the query is executed - unless there's an ORDER BY clause.
One of the most important points to understand about SQL is that a table has no guaranteed order, because a table is supposed to represent a set (or multiset if it has duplicates), and a set has no order. This means that when you query a table without specifying an ORDER BY clause, the query returns a table result, and SQL Server is free to return the rows in the output in any order. If the results happen to be ordered, it may be due to optimization reasons. The point I'm trying to make is that any order of the rows in the output is considered valid, and no specific order is guaranteed. The only way for you to guarantee that the rows in the result are sorted is to explicitly specify an ORDER BY clause.

Is PostgreSQL IN() statement still fast with up to 1000 arguments?

I'm querying to return all rows from a table except those that are in some list of values that is constant at query time. E.g. SELECT * FROM table WHERE id IN (%), and % is guaranteed to be a list of values, not be a subquery. However, this list of values may be up to 1000 elements long in some cases. Should I limit this to a smaller sublist (as few as 50-100 elements is as low as I can go, in this case) or will there be a negligible performance gain?
I assume it's a large table, otherwise it wouldn't matter much.
Depending on table size and number of keys, this may turn into a sequence scan. If there are many IN keys, Postgres often chooses not to use an index for it. The more keys, the bigger the chance of a sequence scan.
If you use another indexed column in WHERE, like:
select * from table where id in (%) and my_date > '2010-01-01';
It's likely to fetch all rows matching the indexed (my_date) columns, and then perform an in-memory scan on them.
Using a JOIN to a persistent or temporary table may, but does not have to help. It still will need to locate all the rows, either with a nested loop (unlikely for large data), or for a hash/merge join.
I would say the solution is:
Use as few IN keys as possible.
Use other criteria for indexing and querying whenever possible. If IN requires an in-memory scan of all rows, at least there will be fewer of them thanks to additional criteria.
Use a temporary table to JOIN, gives better performance and has no limits. An IN() having a 1000 arguments, will give you problems in any database.

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000