My 'VIEW' is ordered. But when I use a select to show its result, the output is shuffled.
This behavior also happens with the following statement
SELECT * FROM (SELECT * FROM TABLE ORDER BY COLUMNA) AS DERIVEDTABLE
How to prevent this shuffling?
Unfortunately, views can't be ordered in SQL Server. You have to do it on the select statement reading the view.
Lame, I know.
you must only have an ORDER BY on the outermost select that uses the view:
SELECT * FROM (SELECT * FROM TABLE) AS DERIVEDTABLE ORDER BY YourColumn
unless you use TOP
SELECT * FROM (SELECT top 1000000 * FROM TABLE ORDER BY COLUMNA) AS DERIVEDTABLE
however the final sort order could still change depending on the outer query, unless you add another order by:
SELECT * FROM (SELECT top 1000000 * FROM TABLE ORDER BY COLUMNA) AS DERIVEDTABLE
ORDER BY COLUMNA
You cannot. The only way to specify an order is to ask for it at the outermost level of the query - that's the only place where ORDER BY is intended to order the final result set. Any other use of ORDER BY is only intended to assist with other operations (such as to define what the "TOP 10" are when using TOP).
SQL Server 2000 used to be able to be tricked into applying an ORDER BY in a view, and unfortunately, the view designers in Enterprise Manager and SSMS continue to pretend that this works. It doesn't. By not allowing the trick to work any longer, this presumably offers more opportunities for query optimization.
Related
I have the following view:
CREATE OR REPLACE VIEW {accountId}.last_assets AS
SELECT DISTINCT ON (coin) *
from {accountId}.{tableAssetsName}
ORDER BY coin, ts DESC;
The view returns, for each coin, the latest update.
and, in an unrelated question, a comment was:
Note: DISTINCT is always a red flag. (almost always)
I read about this and it seems to have a performance issue with Postgres. In my scenario, I don't have any issues with performance, but I would like to understand:
how could such a query be rewritten to not use DISTINCT then?
There is a substantial difference between DISTINCT and the Postgres proprietary DISTINCT ON () operator.
Does whoever wrote that "DISTINCT is always a red flag" actually know the difference between DISTINCT and DISTINCT ON in Postgres?
The problem that your view solves, is known as greatest-n-per-group and in Postgres distinct on is typically more efficient than the alternatives.
The problem can be solved with e.g. window functions:
CREATE OR REPLACE VIEW {accountId}.last_assets AS
SELECT ....
FROM (
SELECT *,
row_number() over (partition by coin order by ts desc) as rn
from {accountId}.{tableAssetsName}
) t
WHERE rn = 1;
But I wouldn't be surprised if that is actually slower than you current solution.
Is there a shorter way to filter a cte on rown_num = 1 rather than an external where clause? I vaguely recall doing this in teradata with a 'qualify' statement. Is there a less code way I can use in Postgres?
with
first_touch as (
select
l.session_id as last_session,
l.client_id,
f.session_id as first_session,
row_number() over(partition by f.client_id order by f.date asc) as rn
from ga_marketing.sessions l
join ga_marketing.sessions f on l.client_id = f.client_id
where l.date between '2021-06-01' and '2021-06-11'
)
select *
from first_touch
where rn = 1
I would rather somehow filter within the cte for rn=1 rather than outside. Is this possible? Is there a shorter way to write what I want?
You need at least two levels of select to do that with a Window function. You could do both levels in the CTE if you wanted and then have a third dummy select outside the CTE, but I don't see what the point of that would be, other than to make the dummy select appear cleaner (no WHERE clause, no column "rn"). That part would get shorter, but the CTE would get longer. Or you could just do away with the CTE altogether and write nested queries directly, which I guess would be "shorter" in the number keystrokes used. Or you could encapsulate different fragments into a view, to hide some of the levels from sight.
I think you could also write this using DISTINCT ON or JOIN LATERAL, rather than using a window function, but that doesn't seem to be what you are asking.
Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id
I have a query which run extremely slow when "checking is_member" in comparison to just loading the whole dataset. This view acts as a security check, it checks if you are a member of a particular group - ie group 1, then the next column will state what access it has - ie division 2.
This view then is joined with the Fact table, so that it will only retrieve division 2 rows.
The question is, does the is_member execute for each line of Fact data? Just my theory because it runs 1000 times faster without this view. And if anyone can suggest an alternative structure?
WITH group_security AS (SELECT DISTINCT division_cod FROM dbo.dim_group_security_division AS gsd
WHERE (IS_MEMBER(group_name) = 1))
SELECT dbo.dim_division.dim_division_key, dbo.dim_division.division_ID, dbo.dim_division.division_code, dbo.dim_division.division_name
FROM dbo.dim_division INNER JOIN
group_security ON dbo.dim_division.division_code = group_security.division_code OR group_security.division_code = 'ALL'
Since you JOIN on dbo.dim_division.division_code, do you have an index on this column?
Alternatively you could give this a try:
SELECT dim.dim_division_key,
dim.division_ID,
dim.division_code,
dim.division_name
FROM dim_division dim
WHERE EXISTS ( SELECT *
FROM dbo.dim_group_security_division gsd
WHERE gsd.division_code IN ('ALL', dim.division_code)
AND IS_MEMBER(gsd.group_name) = 1 )
This way the system can stop at the first 'match' in dim_group_security_division instead of having to find all and then aggregate the result because of the DISTINCT.
In this case, it might also be useful to have an index on gsd.division_code to speed things up a bit.
I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part