Which row is returned using DISTINCT ON in Postgres - postgresql

When I use DISTINCT ON in PostgreSQL (distinct in django) which rows are retrieved in the group of rows with same fields?

The documentation says:
A set of rows for which all the expressions are equal are considered
duplicates, and only the first row of the set is kept in the output.
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the
rows arriving at the DISTINCT filter.
So if you add an ORDER BY clause, the first row in that order is kept.
Without an ORDER BY clause, there is no way of telling which row will be kept.

Related

How PostgreSQL orders the data while using a DISTINCT * option

When i use SELECT * FROM table, PostgreSQL is returning the data ordered by id. But when i use SELECT DISTINCT * FROM table, PostgreSQL is returning the same dataset as there are no duplicates but the order has been changed which is beyond my understanding.
How does PostgreSQL sort the data while using DISTINCT * and without specifying any ORDER BY clause.
If you put DISTINCT into a query, PostgreSQL sorts the result set by all result columns in order to eliminate duplicates. The sort order is “implementation defined” unless you add an explicit ORDER BY clause.
Two remarks:
without the DISTINCT, the table is returned in id order because you inserted it that way and performed no updates or deletes, and because there are no concurrent sequential scans on the table. You can never rely on an order in the result set unless you use ORDER BY.
DISTINCT can be very expensive on large result sets. Use it only if you are certain you need it.

Why does Postgres choose different data solely based on columns selected?

I'm running two different queries with two unions each inside a subquery:
So the structure is:
SELECT *
FROM (subquery_1
UNION SELECT subquery_2)
Now, if I perform the query on the left, I get this result:
However, the query on the right returns this result:
How are the results differing even though the conditions have not changed in either query, and the only difference was one of the selected columns in a subquery?
This is very counter-intuitive.
The operator UNION removes duplicate rows from the returned resultset.
Removing a column from the SELECT statement may produce duplicate rows that would not exist if the removed column was there.
Try UNION ALL instead, which will return in any case all the rows of the unioned queries.
See a simplified demo.

When are the offset and limit keywords executed in a Postgresql query?

I would like to understand when the offset and limit statements execute in a Postgresql query. Given a query with a format such as
select
a.*,
(-- some subquery here) as sub_query_result
from some_table a
where -- some condition
offset :offset
limit :limit
My understanding is the table will first be filtered using the where statement, and then the remaining rows will be projected into the form as defined by select statement.
Do the offset and limit statements execute after all operations have occurred in the select statement? Or does it apply the where, offset, and limit statements first and then the select part of the query?
I am hoping that it applies the where, offset, and limit statements first that if I had a result set say of 10,000 rows, and I only want the 2nd page of 1000, it would only execute the subquery 1000 times, for example.
A query with LIMIT but without ORDER BY makes a little sense. From the documentation:
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows.
When the ORDER BY clause is present, expressions in the select list (including subqueries or functions) must be evaluated for as many rows as needed to determine the proper order. In best scenarios the number of computed rows may be limited to the sum LIMIT + OFFSET, if the sum is less than the number of filtered rows. This means that (in some simplification) the greater OFFSET the longer the query is run:
The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient.
In some cases there may be optimizations when the planner will recognize an expression as immutable, but generally you should expect that a subquery will be executed at least LIMIT + OFFSET times. In Postgres 9.5 or earlier the number of computed rows may be even larger if the ordering is not based on an index.

Tsql, union changes result order, union all doesn't

I know UNION removes duplicates but it changes result order even when there are no duplicates.
I have two select statements, no order by statement anywhere
I want union them with or without (all)
i.e.
SELECT A
UNION (all)
SELECT B
"Select B" actually contains nothing, no entry will be returned
if I use "Select A union Select B", the order of the result is different from just "Select A"
if I use:
SELECT A
UNION ALL
SELECT B
the order of the result is the same as "Select A" itself and there are no duplicates in "Select A" at all.
Why is this? it is unpredictable.
The only way to get a particular order of results from an SQL query is to use an ORDER BY clause. Anything else is just relying on coincidence and the particular (transitory) state of the server at the time you issue your query.
So if you want/need a particular order, use an ORDER BY.
As to why it changes the ordering of results - first, UNION (without ALL) guarantees to remove all duplicates from the result - not just duplicates arising from the different queries - so if the first query returns duplicate rows and the second query returns no rows, UNION still has to eliminate them.
One common, easy way to determine whether you have duplicates in a bag of results is to sort those results (in whatever sort order is most convenient to the system) - in this way, duplicates end up next to each other and so you can then just iterate over these sorted results and if(results[index] == results[index-1]) skip;.
So, you'll commonly find that the results of a UNION (without ALL) query have been sorted - in some arbitrary order. But, to re-emphasise the original point, what ordering was applied is not defined, and certainly shouldn't be relied upon - any patches to the software, changes in indexes or statistics may result in the system choosing a different sort order the next time the query is executed - unless there's an ORDER BY clause.
One of the most important points to understand about SQL is that a table has no guaranteed order, because a table is supposed to represent a set (or multiset if it has duplicates), and a set has no order. This means that when you query a table without specifying an ORDER BY clause, the query returns a table result, and SQL Server is free to return the rows in the output in any order. If the results happen to be ordered, it may be due to optimization reasons. The point I'm trying to make is that any order of the rows in the output is considered valid, and no specific order is guaranteed. The only way for you to guarantee that the rows in the result are sorted is to explicitly specify an ORDER BY clause.

PostgreSQL changing returned rows order

I have a table named categories, which contains ID(long), Name(varchar(50)), parentID(long), and shownByDefault(boolean) columns.
This table contains 554 records. All the shownByDefaultValues are 'false'.
When I execute 'select id, name from categories', pg returns me all the categories,
orderer by its id.
Then I update some of the rows of the table('update categories set shownByDefault where parentId = 1'), update OK.
Then, when I try to execute the first query, which returns all the categories, they are
returner with a very weird order.
I do not have problem to add 'order by', but since I am using JPA to get this values, anyone knows what the problem is or if there is a way to fix this?
That's not a problem. The order of rows returned by a SQL SELECT is undefined unless it has an ORDER BY. The order you get them is usually influenced by the order they are stored in the table and/or the indices that are used by the statement.
So depending on that order without using ORDER BY is a very, very bad idea.
If you need them in some order, simply specify that.
It is important that a table is a set of rows and not a sequence of rows.
From the docs:
If the ORDER BY clause is specified, the returned rows are sorted in the specified order. If ORDER BY is not given, the rows are returned in whatever order the system finds fastest to produce.
The rows are returned in whatever their physical order on disk is; you can reorder them physically using the CLUSTER SQL command, but due to the way Postgres works they'll become unordered as soon as you start modifying rows.
For what you're doing an ORDER BY is the right answer.