Linq To Entities - Any VS First VS Exists - entity-framework

I am using Entity Framework and I need to check if a product with name = "xyz" exists ...
I think I can use Any(), Exists() or First().
Which one is the best option for this kind of situation? Which one has the best performance?
Thank You,
Miguel

Okay, I wasn't going to weigh in on this, but Diego's answer complicates things enough that I think some additional explanation is in order.
In most cases, .Any() will be faster. Here are some examples.
Workflows.Where(w => w.Activities.Any())
Workflows.Where(w => w.Activities.Any(a => a.Title == "xyz"))
In the above two examples, Entity Framework produces an optimal query. The .Any() call is part of a predicate, and Entity Framework handles this well. However, if we make the result of .Any() part of the result set like this:
Workflows.Select(w => w.Activities.Any(a => a.Title == "xyz"))
... suddenly Entity Framework decides to create two versions of the condition, so the query does as much as twice the work it really needed to. However, the following query isn't any better:
Workflows.Select(w => w.Activities.Count(a => a.Title == "xyz") > 0)
Given the above query, Entity Framework will still create two versions of the condition, plus it will also require SQL Server to do an actual count, which means it doesn't get to short-circuit as soon as it finds an item.
But if you're just comparing these two queries:
Activities.Any(a => a.Title == "xyz")
Activities.Count(a => a.Title == "xyz") > 0
... which will be faster? It depends.
The first query produces an inefficient double-condition query, which means it will take up to twice as long as it has to.
The second query forces the database to check every item in the table without short-circuiting, which means it could take up to N times longer than it has to, depending on how many items need to be evaluated before finding a match. Let's assume the table has 10,000 items:
If no item in the table matches the condition, this query will take roughly half the time as the first query.
If the first item in the table matches the condition, this query will take roughly 5,000 times longer than the first query.
If one item in the table is a match, this query will take an average of 2,500 times longer than the first query.
If the query is able to leverage an index on the Title and key columns, this query will take roughly half the time as the first query.
So in summary, IF you are:
Using Entity Framework 4 (since newer versions might improve the query structure) Entity Framework 6.1 or earlier (since 6.1.1 has a fix to improve the query), AND
Querying directly against the table (as opposed to doing a sub-query), AND
Using the result directly (as opposed to it being part of a predicate), AND
Either:
You have good indexes set up on the table you are querying, OR
You expect the item not to be found the majority of the time
THEN you can expect .Any() to take as much as twice as long as .Count(). For example, a query might take 100 milliseconds instead of 50. Or 10 instead of 5.
IN ANY OTHER CIRCUMSTANCE .Any() should be at least as fast, and possibly orders of magnitude faster than .Count().
Regardless, until you have determined that this is actually the source of poor performance in your product, you should care more about what's easy to understand. .Any() more clearly and concisely states what you are really trying to figure out, so stick with that.

Any translates into "Exists" at the database level. First translates into Select Top 1 ... Between these, Exists will out perform First because the actual object doesn't need to be fetched, only a Boolean result value.
At least you didn't ask about .Where(x => x.Count() > 0) which requires the entire match set to be evaluated and iterated over before you can determine that you have one record. Any short-circuits the request and can be significantly faster.

One would think Any() gives better results, because it translates to an EXISTS query... but EF is awfully broken, generating this (edited):
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent2]
WHERE Condition
)) THEN cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
Instead of:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit)
ELSE cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
...basically doubling the query cost (for simple queries; it's even worse for complex ones)
I've found using .Count(condition) > 0 is faster pretty much always (the cost is exactly the same as a properly-written EXISTS query)

Ok, I decided to try this out myself. Mind you, I'm using the OracleManagedDataAccess provider with the OracleEntityFramework, but I'm guessing it produces compliant SQL.
I found that First() was faster than Any() for a simple predicate. I'll show the two queries in EF and the SQL that was generated. Mind you, this is a simplified example, but the question was asking whether any, exists or first was faster for a simple predicate.
var any = db.Employees.Any(x => x.LAST_NAME.Equals("Davenski"));
So what does this resolve to in the database?
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME")
)) THEN 1 ELSE 0 END AS "C1"
FROM ( SELECT 1 FROM DUAL ) "SingleRowTable1"
It's creating a CASE statement. As we know, ANY is merely syntatic sugar. It resolves to an EXISTS query at the database level. This happens if you use ANY at the database level as well. But this doesn't seem to be the most optimized SQL for this query.
In the above example, the EF construct Any() isn't needed here and merely complicates the query.
var first = db.Employees.Where(x => x.LAST_NAME.Equals("Davenski")).Select(x=>x.ID).First();
This resolves to in the database as:
SELECT
"Extent1"."ID" AS "ID"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME") AND (ROWNUM <= (1) )
Now this looks like a more optimized query than the initial query. Why? It's not using a CASE ... THEN statement.
I ran these trivial examples several times, and in ALMOST every case, (no pun intended), the First() was faster.
In addition, I ran a raw SQL query, thinking this would be faster:
var sql = db.Database.SqlQuery<int>("SELECT ID FROM MYSCHEMA.EMPLOYEES WHERE LAST_NAME = 'Davenski' AND ROWNUM <= (1)").First();
The performance was actually the slowest, but similar to the Any EF construct.
Reflections:
EF Any doesn't exactly map to how you might use Any in the database. I could have written a more optimized query in Oracle with ANY than what EF produced without the CASE THEN statement.
ALWAYS check your generated SQL in a log file or in the debug output window.
If you're going to use ANY, remember it's syntactic sugar for EXISTS. Oracle also uses SOME, which is the same as ANY. You're generally going to use it in the predicate as a replacement for IN. In this case it generates a series of ORs in your WHERE clause. The real power of ANY or EXISTS is when you're using Subqueries and are merely testing for the EXISTENCE of related data.
Here's an example where ANY really makes sense. I'm testing for the EXISTENCE of related data. I don't want to get all of the records from the related table. Here I want to know if there are Surveys that have Comments.
var b = db.Survey.Where(x => x.Comments.Any()).ToList();
This is the generated SQL:
SELECT
"Extent1"."SURVEY_ID" AS "SURVEY_ID",
"Extent1"."SURVEY_DATE" AS "SURVEY_DATE"
FROM "MYSCHEMA"."SURVEY" "Extent1"
WHERE ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."COMMENTS" "Extent2"
WHERE ("Extent1"."SURVEY_ID" = "Extent2"."SURVEY_ID")
))
This is optimized SQL!
I believe the EF does a wonderful job generating SQL. But you have to understand how the EF constructs map to DB constructs else you can create some nasty queries.
And probably the best way to get a count of related data is to do an explicit Load with a Collection Query count. This is far better than the examples provided in prior posts. In this case you're not loading related entities, you're just obtaining a count. Here I'm just trying to find out how many Comments I have for a particular Survey.
var d = db.Survey.Find(1);
var e = db.Entry(d).Collection(f => f.Comments)
.Query()
.Count();

Any() and First() is used with IEnumerable which gives you the flexibility for evaluating things lazily. However Exists() requires List.
I hope this clears things out for you and help you in deciding which one to use.

Related

where column in (single value) performance

I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!

Entityframework 6.2.0 IQueryable.Count VS IQueryable.ToList().Count

This is a very weird problem
In short
var q = (some query).Count();
Gives my a number and
var q = (some query).ToList().Count();
Gives me entirely different number...
with mentioning that (some query) has two includes (joins)
is there a sane explanation for that???
EDIT: here is my query
var q = db.membership_renewals.Include(i => i.member).Include(i => i.sport).Where(w => w.isDeleted == false).Count();
this gives me a wrong number
and this:
var q = db.membership_renewals.Include(i => i.member).Include(i => i.sport).Where(w => w.isDeleted == false).ToList().Count();
Gives me accurate number..
EDIT 2
Wher I wrote my query as linq query it worked perfectly...
var q1 = (from d in db.membership_renewals where d.isDeleted == false join m in db.members on d.mr_memberId equals m.m_id join s in db.sports on d.mr_sportId equals s.s_id select d.mr_id).Count();
I think the problem that entity framework doesn't execute the joins in the original query but forced to execute them in (ToList())...
I Finally figured out what's going on...
The database tables are not linked together in the database (there are no relationship or constraints defined in the database itself) so the code doesn't execute the (inner join) part.
However my classes on the other hand are well written so when I perform (ToList()) it automatically ignores the unbound rows...
And when I wrote the linq query defining the relation ship keys (primary and foreign) it worked alright because now the database understands my relation between tables...
Thanks everyone you've been great....
My guess is IQueryable gives a smaller number cause not all the objects are loaded, kind of like a stream in Java, but IQueryable.toList().count() forces the Iqueryable to load all the data and it is traversed by the list constructor and stored in the list so IQueryable.toList().Count() is the accurate answer. This is based on 5 minutes of search on MSDN.
The idea is the underlying datastore of the IQueryable is a database iterator so it executes differently every time because it executes the query again on the database, so if you call it twice against the same table, and the data has changed you get different results. This is called delayed execution. But when you say IQueryable.ToList() you force the iterator to do the whole iteration once and dump the results in a list which is constant

Entity framework generates inefficient SQL

I wonder why is Entity framework generating such an inefficient SQL query. In my code I expected the WHERE to act upon the INCLUDE:
db.Employment.Where(x => x.Active).Include(x => x.Employee).Where(x => x.Employee.UserID == UserID)
but I ended up with a double SQL JOIN:
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID], [e].[ID], [e].[Active], [e].[Address], [e].[BirthDate], [e].[CitizenshipID], [e].[City], [e].[CountryID], [e].[Email], [e].[FirstName], [e].[Gender], [e].[LastName], [e].[Note], [e].[Phone], [e].[PostalCode], [e].[TaxNumber], [e].[UserID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
INNER JOIN [Employee] AS [e] ON [x].[EmployeeID] = [e].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
I found out that this query will create better SQL:
db.Employment.Where(x => x.Active).Where(x => x.Employee.UserID == UserID)
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
However, the problem here that referenced entities are not retrieved from the DB.
Why don't two codes produce same SQLs?
The SQL is different because the statments are different.
Entity Framework does produce inefficient TSQL, it always has. By abstracting the subtleties that are necessary for SQL with good performance and replacing them with "belt and braces" nearly always work alternatives you sacrafice performance for utility.
If you need good performance, write the SQL yourself. Dapper works well for me. You can't realistically expect a "one size fits all" solution to come up with the best code for your specific situation. You can do this across the board or just where you need to.
Unless you have high volume or specific performance requirements get on with it and use whatever you find easiest. If you need to tune your queries to your database you are going to have learn the details of your database engine and implement the queries yourself. If you are expecting the next iteration of Entity Framework to be the magic bullet that allows you fast, efficient SQL data access with minimal knowledge, good luck.
P.S.
Off-topic but, NoSQL probably isn't the answer either, is just a different class of database.

COUNT(field) returns correct amount of rows but full SELECT query returns zero rows

I have a UDF in my database which basically tries to get a station (e.g. bus/train) based on some input data (geographic/name/type). Inside this function i try to check if there are any rows matching the given values:
SELECT
COUNT(s.id)
INTO
firsttry
FROM
geographic.stations AS s
WHERE
ST_DWithin(s.the_geom,plocation,0.0017)
AND
s.name <-> pname < 0.8
AND
s.type ~ stype;
The firsttry variable now contains the value 1. If i use the following (slightly extended) SELECT statement i get no results:
RETURN query SELECT
s.id, s.name, s.type, s.the_geom,
similarity(
regexp_replace(s.name::text,'(Hauptbahnhof|Hbf)','Hbf'),
regexp_replace(pname::text,'(Hauptbahnhof|Hbf)','Hbf')
)::double precision AS sml,
st_distance(s.the_geom,plocation) As dist from geographic.stations AS s
WHERE ST_DWithin(s.the_geom,plocation,0.0017) and s.name <-> pname < 0.8
AND s.type ~ stype
ORDER BY dist asc,sml desc LIMIT 1;
the parameters are as follows:
stype = '^railway'
pname = 'Amsterdam Science Park'
plocation = ST_GeomFromEWKT('SRID=4326;POINT(4.9492530 52.3531670)')
the tuple i need to be returned is:
id name type geom (displayed as ST_AsText)
909658;"Amsterdam Sciencepark";"railway_station";"POINT(4.9482893 52.352904)"
The same UDF returns quite well for a lot of other stations, but this is one (of more) which just won't work. Any suggestions?
P.S. The use of the <-> operator is coming from the pg_trgm module.
Some ideas on how to troubleshoot this:
Break your troubleshooting into steps. Start with the simplest query possible. No aggregates, just joins and no filters. Then add filters. Then add order by, then add aggregates. Look at exactly where the change occurs.
Try reindexing the database.
One possibility that occurs to me based on this is that it could be a corrupted index used in the second query but not the first. I have seen corrupted indexes in the past and usually they throw errors but at least in theory they should be able to create a problem like this.
If this is correct, your query will suddenly return rows if you remove the ORDER BY clause.
If you have a corrupted index, then you need to pay close attention to hardware. Is the RAM ECC? Is the processor overheating? How are you disks doing?
A second possibility is that there is a typo on a join condition of filter statement. Normally this is something I would suspect first but it is easy enough to weed out index problems to start there. If removing the ORDER BY doesn't change things, then chances are it is a typo. If you can't find a typo, then try reindexing.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.