CTE Execution Order - postgresql

In the following CTE statement:
CREATE TABLE test_table (Field1 INTEGER, Field2 INTEGER);
CREATE TABLE test_table2 (Field1 INTEGER, Field2 INTEGER);
WITH table_stage1(fld1, fld2) AS
(SELECT Field1, Field2 from test_table1)
, table_stage2 AS
(SELECT fld1, fld2 FROM table_stage1 GROUP BY fld1, fld2)
, table_stage3 AS
(SELECT fld1 FROM table_stage1 GROUP BY fld1)
INSERT INTO test_table2(Field1, Field2)
SELECT t1.fld1, t2.fld2
FROM table_stage2 t1
JOIN table_stage3 t2
ON t1.fld1 = t2.fld1;
Can I assume the following order of query execution:
SELECT inside WITH statemt
Concurrent execution of SELECT inside table_stage2, and table_stage3 segments
INSERT INTO waits till execution of table_stage2, table_stage3 is finished
This question is not related to a particular (presented) statement.
I would like to know if by selecting from a named segment there is a guarantee the current statement will be executed after the particular named segment.
Meybe what is significat is having a number of select statements folowed by a write CTE that joins results from previous segments
The PostgeSQL documentations we can read:
The sub-statements in WITH are executed concurrently with each other
and with the main query. Therefore, when using data-modifying
statements in WITH, the order in which the specified updates actually
happen is unpredictable.
I am working on PostgreSQL 9.6

The part of the documentation that you cite refers to the execution of more than one data-modifying statement (using data-modifying statements ... the order is unpredictable). But if in one of the statement is used the name of a previous statement, this means that the current statement refer to all the tuples returned by the previous one.
So, in your example the statements relative to table_stage2 and table_stage3 can be executed in parallel, but using all the tuples returned by table_stage1, while the final insert will be executed by using all the tuples returned by the previous two statements (and so by using all the tuples produced by the previous three statements).
Note that: “B uses all the tuples returned by A” is not necessarily equivalent to: “B is executed after A”: the optimizer can in fact transform B so that it is not necessary to execute A. It is just a semantics definition, and is not related to an actual implementation.

Related

UNION ALL vs UNION for Update/Returning + Select?

I'm trying to turn a query idempotent by marking rows as updated. However, part of the query spec is to return the IDs of rows that matched the filter. I was thinking of doing something like the following:
WITH
prev as (
SELECT id
FROM books
WHERE id = any($1::uuid[])
AND updated
),
updated as (
UPDATE books
SET author = $2 || author, updated = true
WHERE id = any($1::uuid[])
AND not updated
RETURNING id
)
SELECT id FROM prev
UNION ALL
SELECT id FROM updated
I'm hoping to avoid de de-dupe step from using UNION instead of UNION ALL so was wondering if the semantics of the operator guarantee that the 1st query does not see the results of the 2nd.
Related Qs:
Using CTEs for update + select:
Execution order for functions with side-effects: Does postgres union guarantee order of execution when invoking functions with side effects?
Manual de-duping: Update a table from a union select statement
The PostgreSQL WITH docs specify that the two CTEs will be executed concurrently and in the same snapshot, so the UNION ALL is safe to use.
The sub-statements in WITH are executed concurrently with each other and with the main query. Therefore, when using data-modifying statements in WITH, the order in which the specified updates actually happen is unpredictable. All the statements are executed with the same snapshot (see Chapter 13), so they cannot “see” one another's effects on the target tables. This alleviates the effects of the unpredictability of the actual order of row updates, and means that RETURNING data is the only way to communicate changes between different WITH sub-statements and the main query.

Primary key duplicate in a table-valued parameter in stored procedure

I am using following code to insert date by Table Valued Parameter in my SP. Actually it works when one record exists in my TVP but when it has more than one record it raises the following error :
'Violation of Primary key constraint 'PK_ReceivedCash''. Cannot insert duplicate key in object 'Banking.ReceivedCash'. The statement has been terminated.
insert into banking.receivedcash(ReceivedCashID,Date,Time)
select (select isnull(Max(ReceivedCashID),0)+1 from Banking.ReceivedCash),t.Date,t.Time from #TVPCash as t
Your query is indeed flawed if there is more than one row in #TVPCash. The query to retrieve the maximum ReceivedCashID is a constant, which is then used for each row in #TVPCash to insert into Banking.ReceivedCash.
I strongly suggest finding alternatives rather than doing it this way. Multiple users might run this query and retrieve the same maximum. If you insist on keeping the query as it is, try running the following:
insert into banking.receivedcash(
ReceivedCashID,
Date,
Time
)
select
(select isnull(Max(ReceivedCashID),0) from Banking.ReceivedCash)+
ROW_NUMBER() OVER(ORDER BY t.Date,t.Time),
t.Date,
t.Time
from
#TVPCash as t
This uses ROW_NUMBER to count the row number in #TVPCash and adds this to the maximum ReceivedCashID of Banking.ReceivedCash.

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

nested SELECT statements interact in ways that I don't understand

I thought I understood how I can do a SELECT from the results of another SELECT statement, but there seems to be some sort of blurring of scope that I don't understand. I am using SQL Server 2008R2.
It is easiest to explain with an example.
Create a table with a single nvarchar column - load the table with a single text value and a couple of numbers:
CREATE TABLE #temptable( a nvarchar(30) );
INSERT INTO #temptable( a )
VALUES('apple');
INSERT INTO #temptable( a )
VALUES(1);
INSERT INTO #temptable( a )
VALUES(2);
select * from #temptable;
This will return: apple, 1, 2
Use IsNumeric to get only the rows of the table that can be cast to numeric - this will leave the text value apple behind. This works fine.
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1 ;
This returns: 1, 2
However, if I use that exact same query as an inner select, and try to do a numeric WHERE clause, it fails saying cannot convert nvarchar value 'apple' to data type int. How has it got the value 'apple' back??
select
x.NumA
from
(
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1
) x
where x.NumA > 1
;
Note that the failing query works just fine without the WHERE clause:
select
x.NumA
from
(
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1
) x
;
I find this very surprising. What am I not getting? TIA
If you take a look at the estimated execution plan you'll find that it has optimized the inner query into the outer and combined the WHERE clauses.
Using a CTE to isolate the operations works (in SQL Server 2008 R2):
declare #temptable as table ( a nvarchar(30) );
INSERT INTO #temptable( a )
VALUES ('apple'), ('1'), ('2');
with Numbers as (
select cast(a as int) as NumA
from #temptable
where IsNumeric(a) = 1
)
select * from Numbers
The reason you are getting this is fair and simple. When a query is executed there are some steps that are being followed. This is a parse, algebrize, optimize and compile.
The algebrize part in this case will get all the objects you need for this query. The optimize will use these objects to create a best query plan which will be compiled and executed...
So, when you look into that part you will see it will do a table scan on #temptable. And #temptable is defined as the way you created your table. That you will do some compute on it is a different thing..... The column still has the nvarchar datatype..
To know how this works you have to know how to read a query. First all the objects are retrieved (from table, inner join table), then the predicates (where, on), then the grouping and such, then the select of the columns (with the cast) and then the orderby.
So with that in mind, when you have a combination of selects, the optimizer will still process it that way.. since your select is subordinate to the from and join parts of your query, it will be a reason for getting this error.
I hope i made it a little clear?
The optimizer is free to move expressions in the query plan in order to produce the most cost efficient plan for retrieving the data (the evaluation order of the predicates is not guaranteed). I think using the case expression like bellow produces a NULL in absence of the ELSE clause and thus takes the APPLE out
select a from #temptable where case when isnumeric(a) = 1 then a end > 1

How would I have a field updated sequentially based on a defined where clause order

I want to be able to update a field sequentially based on the order that it was put in the where in ('[john2].[john2]','[john3].[john3]','[john].[john]') clause but it does not appear to update based on its associated order (see SQL below).
How would I have a field updated sequentially based on a pre-defined where clause order?
John
drop sequence temp_seq;
create temp sequence temp_seq;
update gis_field_configuration_bycube
set seq_in_grid = nextval('temp_seq')
where cube = 'Instruments' and level_unique_name in ('[john2].[john2]','[john3].[john3]','[john].[john]');
The order is irrelevant in an IN() construct because it defines a set, not a list of values in a strict sense.
The VALUES clause is what should be used instead.
Also, assuming Postgres 8.4 or better, row_number() would be less cumbersome than creating a temporary sequence.
Here's what should work:
update gis_field_configuration_bycube
set seq_in_grid = rn
from
(select row_number() over () as rn , string from
(values ('[john2].[john2]'),('[john3].[john3]'),('[john].[john]')) as v(string)
) as subquery
where cube = 'Instruments'
and level_unique_name=subquery.string;