pg 9.4 vs 10 differences on order by random() - postgresql

Following query is not randomizing array in postgres 10. Is this expected behaviour?
select array(select generate_series(1,10) order by random());
v9.4.15
array
------------------------
{7,1,10,6,2,8,9,4,5,3}
v10.4
array
------------------------
{1,2,3,4,5,6,7,8,9,10}

This is a consequence of commit 69f4b9c85f168ae006929eec44fc44d569e846b9 that changes how set-returning functions in the SELECT list are handled.
Tim's answer and your comment show how to deal with the problem.

I think the issue here is the newer version of Postgres has an optimizer which is getting smarter, and is caching away the value of random() after a single call to that function.
One workaround is to force a new random value to be calculated for each record. We can add a dummy WHERE clause to force this:
WITH cte AS (
select generate_series(1,10) AS col
)
SELECT col
FROM cte
WHERE col IS NOT NULL
ORDER BY random();
Demo
You may observe in the demo that the order is in fact random. However, in the same demo if you run your orignal query the order won't be random.
Edit:
The reason why this tricks works is that the WHERE clause convinces the optimizer that you really care about the values being used in each record. Therefore, it calls the function in ORDER BY once for each record rather than caching it.

Related

PostgreSQL: what's wrong with first_value(unique_column) OVER ()?

Pursuant to PostgreSQL: detecting the first/last rows of result set, I've been given reason to suspect that such a clause is dangerous or otherwise inappropriate, and want to understand that better. Take:
SELECT last_value(unique_column) OVER (), * FROM mytable;
unique_column is unique and not null. So what's wrong with using OVER () in this way? Is it dangerous/unreliable? Suboptimal? From what I can tell, this should return the value from the last row in the result set—at least, it has when I've tried it. I've been told that "last" doesn't make sense without sorting, but clearly there is a last row that is returned. I've also been told that OVER () means "anything goes", which suggests that the results are unreliable, but so far, every time I've run that kind of query, I've been consistently given the value from the end of the result set.
Now I have found a problem if I use ORDER BY:
SELECT last_value(unique_column) OVER (), * FROM mytable ORDER BY something_else;
But, my solution to that is to subquery:
SELECT last_value(unique_column) OVER (), * FROM (SELECT * FROM mytable ORDER BY something_else) sub;
It's as if OVER () means the analytic functions (like first_value() and last_value()) operate according to the order in which the engine happens the read the table/subquery. And, from what I can tell, you have enough control over the order in which the engine happens to read the table/subquery (without having to do unnecessary sorting).
I'm running PostgreSQL 9.6 in a Debian 9.5 environment.
You should provide ORDER BY inside OVER clause:
SELECT *,
last_value(unique_column)
OVER (ORDER BY sth_else ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM mytable
I should point out that in the last few months, this solution has worked out rather well, and I've not been shown an alternative, so I'm going to continue using it. However, I should point out that it is finicky and can fail if you make certain changes and do not take the analytics into consideration. (No doubt, I'm misusing the feature and it was not developed for this purpose). So I'll use this space to record the gotchas as I find them.
If you order your results, you've got a problem, but I've already explained that in the question.
I tried to use it in an outer join. Since this caused fields in the result set to be null (even though they are taken from fields in a table that cannot be null) this caused OVER() to return NULL. I have a few ideas about how to get around this, but they would make the query very ugly and possibly very inefficient.

Will Postgres' DISTINCT function always return null as the first element?

I'm selecting distinct values from tables thru Java's JDBC connector and it seems that NULL value (if there's any) is always the first row in the ResultSet.
I need to remove this NULL from the List where I load this ResultSet. The logic looks only at the first element and if it's null then ignores it.
I'm not using any ORDER BY in the query, can I still trust that logic? I can't find any reference in Postgres' documentation about this.
You can add a check for NOT NULL. Simply like
select distinct columnName
from Tablename
where columnName IS NOT NULL
Also if you are not providing the ORDER BY clause then then order in which you are going to get the result is not guaranteed, hence you can not rely on it. So it is better and recommended to provide the ORDER BY clause if you want your result output in a particular output(i.e., ascending or descending)
If you are looking for a reference Postgresql document then it says:
If ORDER BY is not given, the rows are returned in whatever order the
system finds fastest to produce.
If it is not stated in the manual, I wouldn't trust it. However, just for fun and try to figure out what logic is being used, running the following query does bring the NULL (for no apparent reason) to the top, while all other values are in an apparent random order:
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
However, cross joining the table with a modified version of itself brings two NULLs to the top, but random NULLs dispersed througout the resultset. So it doesn't seem to have a clear-cut logic clumping all NULL values at the top.
with t(n) as (values (1),(2),(1),(3),(null),(8),(0))
select distinct * from t
cross join (select n+3 from t) t2

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.

PostgreSQL array_agg order

Table 'animals':
animal_name animal_type
Tom Cat
Jerry Mouse
Kermit Frog
Query:
SELECT
array_to_string(array_agg(animal_name),';') animal_names,
array_to_string(array_agg(animal_type),';') animal_types
FROM animals;
Expected result:
Tom;Jerry;Kerimt, Cat;Mouse;Frog
OR
Tom;Kerimt;Jerry, Cat;Frog;Mouse
Can I be sure that order in first aggregate function will always be the same as in second.
I mean I would't like to get:
Tom;Jerry;Kermit, Frog;Mouse,Cat
Use an ORDER BY, like this example from the manual:
SELECT array_agg(a ORDER BY b DESC) FROM table;
If you are on a PostgreSQL version < 9.0 then:
From: http://www.postgresql.org/docs/8.4/static/functions-aggregate.html
In the current implementation, the order of the input is in principle unspecified. Supplying the input values from a sorted subquery will usually work, however. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
So in your case you would write:
SELECT
array_to_string(array_agg(animal_name),';') animal_names,
array_to_string(array_agg(animal_type),';') animal_types
FROM (SELECT animal_name, animal_type FROM animals) AS x;
The input to the array_agg would then be unordered but it would be the same in both columns. And if you like you could add an ORDER BY clause to the subquery.
According to Tom Lane:
... If I read it right, the OP wants to be sure that the two aggregate functions will see the data in the *same* unspecified order. I think that's a pretty safe assumption. The server would have to go way out of its way to do differently, and it doesn't.
... So it is documented behavior that an aggregate without its own ORDER BY will see the rows in whatever order the FROM clause supplies them.
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
Source: PostgreSQL mailing list
Do this:
SELECT
array_to_string(array_agg(animal_name order by animal_name),';') animal_names,
array_to_string(array_agg(animal_type order by animal_type),';') animal_types
FROM
animals;

Why does this Oracle 10g SQL run slow only when I query a subquery with a where clause?

I can't paste in the entire SQL for various reasons, so consider this example:
select *
from
(select nvl(get_quantity(1), 10) available_qty
from dual)
where available_qty > 30;
get_quantity is a function that makes a calculation based on the ID of a record that's passed through it. If it returns null, I use nvl() to force it to 10.
The query runs very slow when I use the WHERE clause in the parent query. When I comment out the WHERE clause, however, it runs very fast. What I don't get is why it can display the data very fast, but it can't query it just as fast. I am querying the results of a subquery, too. I was under the impression that subqueries return a "rendered" dataset. It's almost as if querying the available_qty identifier is causing it to reference something within the subquery.
This is why I don't think the contents of the get_quantity function are relevant here, so I didn't bother posting it. Instead, I think it's a misunderstanding on my part of how Oracle handles subqueries and whatnot.
Do any of you Oracle gurus have any idea what I am doing wrong?
Afterthought: as I was entering tags for this question, the tag "correlated subquery" came up. In doing some quick research, it seems that this type of subquery somewhat depends on the outer query. Could this be related to my problem?
Let's try an experiment. First we'll run the following query:
select lvl, rnd
from (select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd from dual) b;
The "a" subquery will return 5 rows with values from 1 to 5. The "b" subquery will return one row with a random value. If the function is run before the two tables are join (by Cartesian), the same random value will be returned for each row. The actual results:
LVL RND
---------- ----------
1 .417932089
2 .963531718
3 .617016889
4 .128395638
5 .069405568
5 rows selected.
Clearly the function was run for each of the joined rows, not for the subquery before the join. This is a result of Oracle's optimizer deciding that the best path for the query is to do things in that order. To prevent this, we have to add something to the second subquery that will make Oracle run the subquery in it's entirety before performing the join. We'll add rownum to the subquery, since Oracle knows rownum will change if it's run after the join. The following query demonstrates this:
select lvl, rnd from (
select level as lvl from dual connect by level <= 5) a,
(select dbms_random.value() rnd, rownum from dual) b;
As you can see from the results, the function was only run once in this case:
LVL RND
---------- ----------
1 .028513902
2 .028513902
3 .028513902
4 .028513902
5 .028513902
5 rows selected.
In your case, it seems likely that the filter provided by the where clause is making the optimizer take a different path, where it's running the function repeatedly, rather than once. By making Oracle run the subquery as written, you should get more consistent run-times.