What does CASE WHEN EXISTS (select * from table) do? Is there a better way to write this? - tsql

I need to resolve an issue with performance of the following query (not my code). It's a section of a larger stored procedure. This particular part is taking a very long time to execute (I just cancelled it after 20 mins) and I don't understand the intent of the case statement in the code. My research to date has only shown examples of checking for a particular column using schema tables, or checking if any records exists using select 1...
This is doing select *. Can someone help me to understand the case when exists subquery part of this query and if there is a better (more efficient in terms of time to execute) way to write this?
INSERT INTO [PaymentStage06]
SELECT
CASE
WHEN EXISTS (SELECT *
FROM [transaction] AS [MPT])
THEN 'Y'
ELSE 'N'
END AS [DuplicateReleased],
[MPT].[participantid] AS [ParticipantID],
[MPT].[courseid] AS [CourseID],
[MPT].[session] AS [SessionID],
[MPT].[feedbackcompletedid] AS [TransactionID],
GETDATE() AS [Refreshed]
FROM
[transaction] AS [MPT]
Many thanks.
Andrew
I've looked for examples of "case when exists(select *" but haven't found anything concrete. My guess is it's either checking for the existence of ANY record in that table (in which case why not just select 1), OR it's checking for a match based on all columns, but both of those seem pointless when the outer query is working on the same table. I'm obviously missing the point :)
This is the estimated execution plan https://www.brentozar.com/pastetheplan/?id=H1VMU7Goj
According to the execution plan, it is performing a 'left semi join' operation between the inner and outer query. This appears to be like an intersect query... but still pointless because it's working on the 1 table.

Related

PostgreSQL: what's wrong with first_value(unique_column) OVER ()?

Pursuant to PostgreSQL: detecting the first/last rows of result set, I've been given reason to suspect that such a clause is dangerous or otherwise inappropriate, and want to understand that better. Take:
SELECT last_value(unique_column) OVER (), * FROM mytable;
unique_column is unique and not null. So what's wrong with using OVER () in this way? Is it dangerous/unreliable? Suboptimal? From what I can tell, this should return the value from the last row in the result set—at least, it has when I've tried it. I've been told that "last" doesn't make sense without sorting, but clearly there is a last row that is returned. I've also been told that OVER () means "anything goes", which suggests that the results are unreliable, but so far, every time I've run that kind of query, I've been consistently given the value from the end of the result set.
Now I have found a problem if I use ORDER BY:
SELECT last_value(unique_column) OVER (), * FROM mytable ORDER BY something_else;
But, my solution to that is to subquery:
SELECT last_value(unique_column) OVER (), * FROM (SELECT * FROM mytable ORDER BY something_else) sub;
It's as if OVER () means the analytic functions (like first_value() and last_value()) operate according to the order in which the engine happens the read the table/subquery. And, from what I can tell, you have enough control over the order in which the engine happens to read the table/subquery (without having to do unnecessary sorting).
I'm running PostgreSQL 9.6 in a Debian 9.5 environment.
You should provide ORDER BY inside OVER clause:
SELECT *,
last_value(unique_column)
OVER (ORDER BY sth_else ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM mytable
I should point out that in the last few months, this solution has worked out rather well, and I've not been shown an alternative, so I'm going to continue using it. However, I should point out that it is finicky and can fail if you make certain changes and do not take the analytics into consideration. (No doubt, I'm misusing the feature and it was not developed for this purpose). So I'll use this space to record the gotchas as I find them.
If you order your results, you've got a problem, but I've already explained that in the question.
I tried to use it in an outer join. Since this caused fields in the result set to be null (even though they are taken from fields in a table that cannot be null) this caused OVER() to return NULL. I have a few ideas about how to get around this, but they would make the query very ugly and possibly very inefficient.

ERROR: syntax error at or near "group"

Hello I'm writing an sql query But i am getting a syntax error on the line with the GROUP BY. What can possibly be the problem, help if you can please.
UPDATE intersection_points i
SET nbr_victimes = sum(tue+bl+bg)
FROM accident_ma a ,intersection_points i
WHERE (ST_DWithin(i.st_intersection,a.geom_acc, 10000) group by st_intersection)) ;
GROUP BY is its own clause, it's not part of a WHERE clause.
This is what you have:
WHERE (
ST_DWithin(i.st_intersection,a.geom_acc, 10000)
group by st_intersection
)
This is what you need:
WHERE ST_DWithin(i.st_intersection,a.geom_acc, 10000)
group by st_intersection
Edit: In response to comments, it sounds like your JOIN is a bit more complex than the UPDATE ... FROM syntax would need. Take a look at the "Notes" section on this page:
When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
Because of this indeterminacy, referencing other tables only within sub-selects is safer, though often harder to read and slower than using a join.
Normally this would involve changing the syntax to something like:
UDPATE SomeTable
SET SomeColumn = 'Some Value'
WHERE AnotherColumn =
(SELECT AnotherColumn
FROM AnotherTable
-- etc.)
However, the use of ST_DWithin() in this query may complicate that quite a bit. Without much deeper knowledge of the table structures, relationships, and overall intent of this update there probably isn't much more help I can give. Essentially you're going to need to clarify for the database exactly what records need to be updated and how to update them, which may involve changing your query to this latter sub-select syntax in some way.
I don' t understand your data structure. I create the following tables from your query. Please check table structure.
if table's structure is this
your query must be
UPDATE intersection_points SET nbr_victimes = (SELECT SUM(a.tue+a.bl+a.bg) FROM accident_ma a WHERE st_dwithin(st_intersection, a.geom_acc, 1000));

Improve query oracle

How can I modify this query to improve it?
I think that doing a join It'd be better.
UPDATE t1 HIJA
SET IND_ESTADO = 'P'
WHERE IND_ESTADO = 'D'
AND NOT EXISTS
(SELECT COD_OPERACION
FROM t1 PADRE
WHERE PADRE.COD_SISTEMA_ORIGEN = HIJA.COD_SISTEMA_ORIGEN
AND PADRE.COD_OPERACION = HIJA.COD_OPERACION_DEPENDIENTE)
Best regards.
According to this article by Quassnoi:
Oracle's optimizer is able to see that NOT EXISTS, NOT IN and LEFT JOIN / IS NULL are semantically equivalent as long as the list values are declared as NOT NULL.
It uses same execution plan for all three methods, and they yield same results in same time.
In Oracle, it is safe to use any method of the three described above to select values from a table that are missing in another table.
However, if the values are not guaranteed to be NOT NULL, LEFT JOIN / IS NULL or NOT EXISTS should be used rather than NOT IN, since the latter will produce different results depending on whether or not there are NULL values in the subquery resultset.
So what you have is already fine. A JOIN would be as good, but not better.
If performance is a problem, there are several guidelines for re-writing a where not exists into a more efficient form:
When given the choice between not exists and not in, most DBAs prefer to use the not exists clause.
When SQL includes a not in clause, a subquery is generally used, while with not exists, a correlated subquery is used.
In many case a NOT IN will produce the same execution plan as a NOT EXISTS query or a not equal query (!=).
In some case a correlated NOT EXISTS subquery can be re-written with a standard outer join with a NOT NULL test.
Some NOT EXISTS subqueries can be tuned using the MINUS operator.
See Burleson for more information.

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

sybase - fails to use index unless string is hard-coded

I'm using Sybase 12.5.3 (ASE); I'm new to Sybase though I've worked with MSSQL pretty extensively. I'm running into a scenario where a stored procedure is really very slow. I've traced the issue to a single SELECT stmt for a relatively large table. Modifying that statement dramatically improves the performance of the procedure (and reverting it drastically slows it down; i.e., the SELECT stmt is definitely the culprit).
-- Sybase optimizes and uses multi-column index... fast!<br>
SELECT ID,status,dateTime
FROM myTable
WHERE status in ('NEW','SENT')
ORDER BY ID
-- Sybase does not use index and does very slow table scan<br>
SELECT ID,status,dateTime
FROM myTable
WHERE status in (select status from allowableStatusValues)
ORDER BY ID
The code above is an adapted/simplified version of the actual code. Note that I've already tried recompiling the procedure, updating statistics, etc.
I have no idea why Sybase ASE would choose an index only when strings are hard-coded and choose a table scan when choosing from another table. Someone please give me a clue, and thank you in advance.
1.The issue here is poor coding. In your release, poor code and poor table design are the main reasons (98%) the optimiser makes incorrect decisions (the two go hand-in-hand, I have not figured out the proportion of each). Both:
WHERE status IN ('NEW','SENT')
and
WHERE status IN (SELECT status FROM allowableStatusValues)
are substandard, because in both cases they cause ASE to create a worktable for the contents between the brackets, which can easily be avoided (and all consequential issues avoided with it). There is no possibility of statistics on a worktable, since the statistics on either t.status or s.status is missing (AdamH is correct re that point), it correctly chooses a table scan.
Subqueries have their place, but never as a substitute for a pure (the tables are related) join. The corrections are:
WHERE status = "NEW" OR status = "SENT"
and
FROM myTable t,
allowableStatusValues s
WHERE t.status = s.status
2.The statement
|Now you don't have to add an index to get statistics on a column, but it's probably the best way.
is incorrect. Never create Indices that you will not use. If you want statistics updated on a column, simply
UPDATE STATISTICS myTable (status)
3.It is important to ensure that you have current statistics on (a) all indexed columns and (b) all join columns.
4.Yes, there is no substitute for SHOWPLAN on every code segment that is intended for release, doubly so for any code with questionable performance. You can also SET NOEXEC ON, to avoid execution, eg. for large result sets.
An index hint will work around it, but is probably not the solution.
Firstly I'd like to know if there is an index on allowableStatusValues.status, if there is then sybase will have stats on it and will have a good idea on the number of values in there.
If not then the optimiser probably won't have a good idea how many different values Status may take. It's then having to make the assumption that you're going to be extracting almost all of the rows from myTable, and the best way of doing this is a table scan (if no covering index).
Now you don't have to add an index to get statistics on a column, but it's probably the best way.
If you do have an index on allowableStatusValues.status, then i'd wonder how good your stats are. Get yourself a copy of sp__optdiag. You probably also need to tune the values of "histogram tuning factor" and "number of histogram steps", increasing these slightly from the defaults will give you more detailed statistics which always helps the optimiser.
Does it still do a table scan if you replace the subquery with a join:
SELECT m.ID,m.status,m.dateTime
FROM myTable m
JOIN allowableStatusValues a on m.status = a.status
ORDER BY ID
Rather than relying on experimental observations of how long a query takes to run, I would highly recommend getting Sybase to show you the execution plans for each query, for example:
SET showplan ON
GO
-- query/procedure call goes here
SELECT id, status, datetime
FROM myTable
WHERE status IN('NEW','SENT')
ORDER BY id
GO
SET showplan OFF
GO
With SET showplan ON, Sybase generates execution plans for every statement it executes. These can be invaluable in helping to identify where queries are not making use of appropriate indexes. For stored procedures in Sybase, the execution plan for the entire procedure is generated when the stored procedure is first executed after being compiled.
If you post the plans for each of your queries we might be able to shed more light on the problem.
Amazingly, using an index hint resolves the issue (see the (index myIndexName) line below - re-written/simplififed code below:
-- using INDEX HINT
SELECT ID,status,dateTime
FROM myTable (index myIndexName)
WHERE status in (select status from allowableStatusValues)
ORDER BY ID
Weird that I have to use this technique to avoid a table scan, but there ya go.
Garrett, by showing only the simplified code, you have likely stripped out exactly the information that would illuminate the source of the problem.
My first guess would be a type mismatch between allowableStatusValues.status and myTable.status. However, that is not the only possibility. As ninesided stated, the complete query plans (using showplan and fmtonly flags), as well as the actual table definitions and stored procedure source, is much more likely to produce a useful answer.