ANDALSO options, stop evaluating when one fails - tsql

I have a SQL select statement that reads items. There are conditions for which items to display, but when one condition fails, I don't need to check the other.
For example:
where item like 'M%'
and item_class='B'
and fGetOnHand(item)>0
If either of the first 2 fail, i do NOT want to do the last one (a call to a user defined function).

From what I have read on this site, SQL Server's AND and OR operators do not follow short circuiting behavior. This means that the call to the UDF could happen first, or maybe not at all, should one of the other conditions happen first and fail.
We might be able to try rewriting your logic using a CASE expression, where the execution order is fixed:
WHERE
CASE WHEN item NOT LIKE 'M%' OR item_class <> 'B'
THEN 0
WHEN fGetOnHand(item) <= 0
THEN 0
ELSE 1 END = 1
The above logic forces the check on item and item_class to happen first. Should either fail, then the first branch of the CASE expression evaluates to 0, and the condition fails. Only if both these two checks pass would the UDF be evaluated.
This is very verbose, but if the UDF call is a serious penalty, then perhaps phrasing your WHERE clause as above would be worth the trade off of verbose code for better performance.

Related

Skipping an iteration in for loop

Is there an effective way to skip an iteration in a for loop?
I have a big dataset that consists of option prices on the S&P 500 index. The dataset ranges from 1992 to 2009. Now, in total, I have 3481 quoting dates that I have stored in certain vector that I call QDvector. I'm only interested in the quoting dates from 2008 until 2009. For each quoting date, I run a certain program. The quoting dates of interest are from 3290 until 3481. However, in some special cases (very few), the program does not work due to lack of stock data. How do I skip these iterations in the for loop?
For instance, suppose that I have
for index = 3290:3481
[...]
end
and suppose that I do not want to take the index == 3389 into account. How do I skip this iteration?
I can use a while loop, but I really do not want to take this index in consideration at all, since I also have to plot certain parameters and I want to skip the parameters corresponding with index == 3389.
I can remove the quoting date from the QDvector. I do not prefer this approach since I have to change too many other variables as well.
I'm simply looking for a good way to skip certain iterations without any consequences.
Yes, the continue statement allows to do that.
for index = 3290:3481
[...]
continue; % wherever applicable
end
Check for the index and execute your code whenever it's not found.
for index = 3290:3481
if index != 3389
[...]
end
end
Without an else statement, nothing will happen when the statement is false, effectively skipping the index.
Alternatively:
for index = 3290:3481
if index == 3389
continue
else
[...]
end
end
Is slightly less efficient, since the check will be run and in most cases, progress onto the else. But for only 191 passes, it probably won't be noticeable.

OrientDB COALESCE Function With Unexpected Results

Using OrientDB 2.1.2, I was trying to use the inherent COALESCE functionality and ran into some strange results.
Goal: select the maximum value of a property based on certain conditions OR 0 if there is no value for that property given the conditions.
Here's what I tried to use to produce my results.
Attempt 1: Just selecting the Maximum value of a property based on some condition - This worked as I expected... a single result
Attempt 2: Same query as before but now I'm adding an extra condition that I know will cause no results to be returned - This also worked as I expected... no results found
Attempt 3: Using COALESCE to select 0 if the result from the second query returns no results - This is where the query fails (see below).
I would expect the result from the second query to return no results, thereby qualifying as a "NULL" result meaning that the COALESCE function should then go on to return 0. What happens instead is that the COALESCE function is seeing the results of the inner select (which again, returns no results) as a valid non-null value, causing the COALESCE function to never return the intended "0" value.
Two questions for those who are familiar with using the OrientDB API:
Do you think this functionality is working properly or should an issue be filed with the orientdb issue tracker?
Is there another way to achieve my goal without using COALESCE or by using COALESCE in a different way?
Try rather:
select coalesce($a, 0) from ... let $a = (subquery) where ...
Or also this variant because the sub-select returns a result set, but the coalescence wants a single value:
select coalesce($a[0], 0) from ... let $a = (subquery) where ...

Pass initial condition as an argument to a custom aggregate

I want to create a function that takes an initial condition as an argument and uses a set of values to compute a final result. In my specific case (has to do with geometry processing in PostGIS), it's important that each member of the set is processed against the current (which might be the initial) state one at a time for keeping the result clean. (I need to deal with sliver and gap issues, and have had a very difficult time doing so any way other than one element at a time.) The processing I need to do is already defined as a function that takes two appropriate arguments (where the first can be the current state and the second can be a value from the set).
So I want something similar to what you would expect is intended by this:
SELECT my_func('some initial condition', my_table.some_column) FROM my_table;
Aggregates seem like a natural fit for this, but I can't figure out how to get the function to accept an initial state. An iterative approach in PL/pgSQL would be fairly straightforward:
CREATE FUNCTION my_func(initial sometype, values sometype[])
-- Returns, language, etc.
AS $$
DECLARE
current sometype := initial;
v sometype;
BEGIN
FOREACH v IN ARRAY values LOOP
current := SomeBinaryOperation(current, v);
END LOOP;
RETURN current;
END
$$
But it would require rolling the values up into an array manually:
SELECT my_func('some initial condition', ARRAY_AGG(my_table.some_column)) FROM my_table;
You can create aggregates with multiple arguments, but the arguments that follow the first one are used as additional arguments to the transition function. I can see no way that one of them could be turned into an initial condition. (At least not without a remarkably hacky function that treats its third argument as an initial condition if the first argument is NULL or similar. And that's only if the aggregate argument can be a constant instead of a column.)
Am I best off just using the PL/pgSQL iterative approach, or is there a way to create an aggregate that accepts its initial condition as an argument? Or is there something I haven't thought of?
I'm on PostgreSQL 9.3 at the moment, but upgrading may be an option if there's new stuff that would help.

Dealing with division errors in PostgresQL without procedural code - is it possible?

I have to produce (in PostgresQL, if that matters) a table containing a column with the quotient of two sums, basically like this (quite simplified):
select name, sum(a)/sum(b), sum(c)/sum(d)
from a_complex_nested_select_query_with_many_zeros
group by name
order by name;
The table has tens of thousands of rows (not too big), but in a few cases, summing over b or d does produce 0, which causes the whole query to fail with Divide by 0.
In researching how to deal with the exception, I was only able to find information on PL/pgSQL Control Structures, which appears to require the creation of a function (but I'm not sure).
My question is of course how to make this query work. Perhaps the answer has something to do with
Can an exception be caught in non-procedural SQL (PostgresQL, perhaps?)
Is this a case where procedural code is necessary?
Can a CASE..WHEN..ELSE..END structure avoid the problem (I'm stuck on this because it looks like the SUM() calls are repeated!), but it is appealing because I do not know enough about Postgres to know whether exception catching has a performance penalty.
Is there a way to, again without a function, ensure SUM() is evaluated once in a CASE expression?
If a function is required, what would it look like?
EDIT By "repeating sum calls" I mean that I know I could write:
select name,
case when sum(b)=0 then null else sum(a)/sum(b) end,
case when sum(d)=0 then null else sum(c)/sum(d) end
and so on, but am not sure if that is a good thing. (I guess someone will answer with a why-don't-you-profile-it but I think there may be better approaches out there, somewhere.)
nullif will return null if the arguments are equal. A division by null evaluates to null
select
name,
sum(a) / nullif(sum(b), 0),
sum(c) / nullif(sum(d), 0)
from a_complex_nested_select_query_with_many_zeros
group by name
order by name;

Linq To Entities - Any VS First VS Exists

I am using Entity Framework and I need to check if a product with name = "xyz" exists ...
I think I can use Any(), Exists() or First().
Which one is the best option for this kind of situation? Which one has the best performance?
Thank You,
Miguel
Okay, I wasn't going to weigh in on this, but Diego's answer complicates things enough that I think some additional explanation is in order.
In most cases, .Any() will be faster. Here are some examples.
Workflows.Where(w => w.Activities.Any())
Workflows.Where(w => w.Activities.Any(a => a.Title == "xyz"))
In the above two examples, Entity Framework produces an optimal query. The .Any() call is part of a predicate, and Entity Framework handles this well. However, if we make the result of .Any() part of the result set like this:
Workflows.Select(w => w.Activities.Any(a => a.Title == "xyz"))
... suddenly Entity Framework decides to create two versions of the condition, so the query does as much as twice the work it really needed to. However, the following query isn't any better:
Workflows.Select(w => w.Activities.Count(a => a.Title == "xyz") > 0)
Given the above query, Entity Framework will still create two versions of the condition, plus it will also require SQL Server to do an actual count, which means it doesn't get to short-circuit as soon as it finds an item.
But if you're just comparing these two queries:
Activities.Any(a => a.Title == "xyz")
Activities.Count(a => a.Title == "xyz") > 0
... which will be faster? It depends.
The first query produces an inefficient double-condition query, which means it will take up to twice as long as it has to.
The second query forces the database to check every item in the table without short-circuiting, which means it could take up to N times longer than it has to, depending on how many items need to be evaluated before finding a match. Let's assume the table has 10,000 items:
If no item in the table matches the condition, this query will take roughly half the time as the first query.
If the first item in the table matches the condition, this query will take roughly 5,000 times longer than the first query.
If one item in the table is a match, this query will take an average of 2,500 times longer than the first query.
If the query is able to leverage an index on the Title and key columns, this query will take roughly half the time as the first query.
So in summary, IF you are:
Using Entity Framework 4 (since newer versions might improve the query structure) Entity Framework 6.1 or earlier (since 6.1.1 has a fix to improve the query), AND
Querying directly against the table (as opposed to doing a sub-query), AND
Using the result directly (as opposed to it being part of a predicate), AND
Either:
You have good indexes set up on the table you are querying, OR
You expect the item not to be found the majority of the time
THEN you can expect .Any() to take as much as twice as long as .Count(). For example, a query might take 100 milliseconds instead of 50. Or 10 instead of 5.
IN ANY OTHER CIRCUMSTANCE .Any() should be at least as fast, and possibly orders of magnitude faster than .Count().
Regardless, until you have determined that this is actually the source of poor performance in your product, you should care more about what's easy to understand. .Any() more clearly and concisely states what you are really trying to figure out, so stick with that.
Any translates into "Exists" at the database level. First translates into Select Top 1 ... Between these, Exists will out perform First because the actual object doesn't need to be fetched, only a Boolean result value.
At least you didn't ask about .Where(x => x.Count() > 0) which requires the entire match set to be evaluated and iterated over before you can determine that you have one record. Any short-circuits the request and can be significantly faster.
One would think Any() gives better results, because it translates to an EXISTS query... but EF is awfully broken, generating this (edited):
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent2]
WHERE Condition
)) THEN cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
Instead of:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit)
ELSE cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
...basically doubling the query cost (for simple queries; it's even worse for complex ones)
I've found using .Count(condition) > 0 is faster pretty much always (the cost is exactly the same as a properly-written EXISTS query)
Ok, I decided to try this out myself. Mind you, I'm using the OracleManagedDataAccess provider with the OracleEntityFramework, but I'm guessing it produces compliant SQL.
I found that First() was faster than Any() for a simple predicate. I'll show the two queries in EF and the SQL that was generated. Mind you, this is a simplified example, but the question was asking whether any, exists or first was faster for a simple predicate.
var any = db.Employees.Any(x => x.LAST_NAME.Equals("Davenski"));
So what does this resolve to in the database?
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME")
)) THEN 1 ELSE 0 END AS "C1"
FROM ( SELECT 1 FROM DUAL ) "SingleRowTable1"
It's creating a CASE statement. As we know, ANY is merely syntatic sugar. It resolves to an EXISTS query at the database level. This happens if you use ANY at the database level as well. But this doesn't seem to be the most optimized SQL for this query.
In the above example, the EF construct Any() isn't needed here and merely complicates the query.
var first = db.Employees.Where(x => x.LAST_NAME.Equals("Davenski")).Select(x=>x.ID).First();
This resolves to in the database as:
SELECT
"Extent1"."ID" AS "ID"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME") AND (ROWNUM <= (1) )
Now this looks like a more optimized query than the initial query. Why? It's not using a CASE ... THEN statement.
I ran these trivial examples several times, and in ALMOST every case, (no pun intended), the First() was faster.
In addition, I ran a raw SQL query, thinking this would be faster:
var sql = db.Database.SqlQuery<int>("SELECT ID FROM MYSCHEMA.EMPLOYEES WHERE LAST_NAME = 'Davenski' AND ROWNUM <= (1)").First();
The performance was actually the slowest, but similar to the Any EF construct.
Reflections:
EF Any doesn't exactly map to how you might use Any in the database. I could have written a more optimized query in Oracle with ANY than what EF produced without the CASE THEN statement.
ALWAYS check your generated SQL in a log file or in the debug output window.
If you're going to use ANY, remember it's syntactic sugar for EXISTS. Oracle also uses SOME, which is the same as ANY. You're generally going to use it in the predicate as a replacement for IN. In this case it generates a series of ORs in your WHERE clause. The real power of ANY or EXISTS is when you're using Subqueries and are merely testing for the EXISTENCE of related data.
Here's an example where ANY really makes sense. I'm testing for the EXISTENCE of related data. I don't want to get all of the records from the related table. Here I want to know if there are Surveys that have Comments.
var b = db.Survey.Where(x => x.Comments.Any()).ToList();
This is the generated SQL:
SELECT
"Extent1"."SURVEY_ID" AS "SURVEY_ID",
"Extent1"."SURVEY_DATE" AS "SURVEY_DATE"
FROM "MYSCHEMA"."SURVEY" "Extent1"
WHERE ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."COMMENTS" "Extent2"
WHERE ("Extent1"."SURVEY_ID" = "Extent2"."SURVEY_ID")
))
This is optimized SQL!
I believe the EF does a wonderful job generating SQL. But you have to understand how the EF constructs map to DB constructs else you can create some nasty queries.
And probably the best way to get a count of related data is to do an explicit Load with a Collection Query count. This is far better than the examples provided in prior posts. In this case you're not loading related entities, you're just obtaining a count. Here I'm just trying to find out how many Comments I have for a particular Survey.
var d = db.Survey.Find(1);
var e = db.Entry(d).Collection(f => f.Comments)
.Query()
.Count();
Any() and First() is used with IEnumerable which gives you the flexibility for evaluating things lazily. However Exists() requires List.
I hope this clears things out for you and help you in deciding which one to use.