Does inner join effect order by? - postgresql

I have a function a() which gives result in a specific order.
I want to do:
select final.*,tablex.name
from a() as final
inner join tablex on (a.key=tablex.key2)
My question is, can I guarantee that the join won't effect the order of rows as a() set it?
a() is:
select ....
from....
joins...
order by x,y,z

The short version:
The order of rows returned by a SQL query is not guaranteed in any way unless you use an order by
Any order you see without an order by is pure coincidence and can not be relied upon.
So how did I always get the correct order so far? when I did Select * from a()
If your function is a SQL function, then the query inside the function is executed "as is" (it's essentially "inlined") so you only run a single query that does have an order by. If it's a PL/pgSQL function and the only thing it does is a RETURN QUERY ... then you again only have a single query that is executed which does have an order by.
Assuming you do use a SQL function, then running:
select final.*,tablex.name
from a() as final
join tablex on a.key=tablex.key2
is equivalent to:
select final.*,tablex.name
from (
-- this is your query inside the function
select ...
from ...
join ...
order by x,y,z
) as final
join tablex on a.key=tablex.key2;
In this case the order by inside the derived table doesn't make sense as it might be "overruled" by an overall order by statement. In fact some databases would outright reject this query (and I sometime wish Postgres would do as well).
Without an order by on the **overall* query, the database is free to choose any order of rows that it wants.
So to get back to the initial question:
can I guarantee that the join won't effect the order of rows as a() set it?
The answer to that is a clear: NO - the order of the rows for that query is in no way guaranteed. If you need an order that you can rely on, you have to specify an order by.
I would even go so far to remove the order by from the function - what if someone runs: select * from a() order by z,y,x - I don't think Postgres will be smart enough to remove the order by inside the function.

Related

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

How to specify two expressions in the select list when the subquery is not introduced with EXISTS

I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.

PostgreSQL array_agg order

Table 'animals':
animal_name animal_type
Tom Cat
Jerry Mouse
Kermit Frog
Query:
SELECT
array_to_string(array_agg(animal_name),';') animal_names,
array_to_string(array_agg(animal_type),';') animal_types
FROM animals;
Expected result:
Tom;Jerry;Kerimt, Cat;Mouse;Frog
OR
Tom;Kerimt;Jerry, Cat;Frog;Mouse
Can I be sure that order in first aggregate function will always be the same as in second.
I mean I would't like to get:
Tom;Jerry;Kermit, Frog;Mouse,Cat
Use an ORDER BY, like this example from the manual:
SELECT array_agg(a ORDER BY b DESC) FROM table;
If you are on a PostgreSQL version < 9.0 then:
From: http://www.postgresql.org/docs/8.4/static/functions-aggregate.html
In the current implementation, the order of the input is in principle unspecified. Supplying the input values from a sorted subquery will usually work, however. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
So in your case you would write:
SELECT
array_to_string(array_agg(animal_name),';') animal_names,
array_to_string(array_agg(animal_type),';') animal_types
FROM (SELECT animal_name, animal_type FROM animals) AS x;
The input to the array_agg would then be unordered but it would be the same in both columns. And if you like you could add an ORDER BY clause to the subquery.
According to Tom Lane:
... If I read it right, the OP wants to be sure that the two aggregate functions will see the data in the *same* unspecified order. I think that's a pretty safe assumption. The server would have to go way out of its way to do differently, and it doesn't.
... So it is documented behavior that an aggregate without its own ORDER BY will see the rows in whatever order the FROM clause supplies them.
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
Source: PostgreSQL mailing list
Do this:
SELECT
array_to_string(array_agg(animal_name order by animal_name),';') animal_names,
array_to_string(array_agg(animal_type order by animal_type),';') animal_types
FROM
animals;

DB2 Query Structure Using User-Defined Function as a Table

I'm a little new to DB2, and am having trouble developing a query. I have created a user-defined function that returns a table of data which I want to then join and select from in larger select statement. I'm working on a sensitive db, so the query below isn't what I'm literally running, but it's almost exactly like it (without the other 10 joins I have to do lol).
select
A.customerId,
A.firstname,
A.lastname,
B.orderId,
B.orderDate,
F.currentLocationDate,
F.currentLocation
from
customer A
INNER JOIN order B
on A.customerId = B.customerId
INNER JOIN table(getShippingHistory(B.customerId)) as F
on B.orderId = F.orderId
where B.orderId = 35
This works great if I run this query without the where clause (or some other where clause that doesn't check for an ID). When I include the where clause, I get the following error:
Error during Prepare 58004(-901)[IBM][CLI Driver][DB2/LINUXX8664]
SQL0901N The SQL statement failed because of a non-severe system
error. Subsequent SQL statements can be processed. (Reason "Bad Plan;
Unresolved QNC found".) SQLSTATE=58004
I have tracked the issue down to fact that I'm using one of join criteria for the parameters (B.customerId). I have validated this fact by replacing B.customerId with a valid customerId, and the query works great. Problem is, I don't know the customerId when calling this query. I know only the orderId (in this example).
Any thoughts on how to restructure this so I can make only 1 call to get all the info? I know the plan is the problem b/c the customerId isn't getting resolved before the function is called.
So if I understand correctly, the function getShippingHistory(customerId) returns a table.
And if you call it with a single customer Id that table gets joined in your query above no problem at all.
But the way you have the query written above, you are asking db2 to call the function for every row returned by your query (i.e. every b.customerId that matches your join and where conditions).
So I'm not sure what behaviour you are expecting, because what you're asking for is a table back for every row in your query, and db2 (nor I) can figure out what the result is supposed to look like.
So in terms of restructuring your query, think about how you can change the getShippingHistory logic when multiple customer Ids are involved.
i found the best solution (given the current query structure) is to use a LEFT join instead of an INNER join in order force the LEFT part of the join to happen which will resolve the customerId to a value by the time it gets to the function call.
select
A.customerId,
A.firstname,
A.lastname,
B.orderId,
B.orderDate,
F.currentLocationDate,
F.currentLocation
from
customer A
INNER JOIN order B
on A.customerId = B.customerId
LEFT JOIN table(getShippingHistory(B.customerId)) as F
on B.orderId = F.orderId
where B.orderId = 35