I have a simple entity framework query.
It uses grouping:
source.GroupBy(x => x.Key)
.Select(x => x.Count(z => z.IsDefaultValue == false) > 0
? x.FirstOrDefault(z => z.IsDefaultValue == false)
: x.FirstOrDefault()
);
Execution plan for it looks like this:
Then I change the query:
source.GroupBy(x => x.Key)
.Select(x => x.Any(z => z.IsDefaultValue == false)
? x.FirstOrDefault(z => z.IsDefaultValue == false)
: x.FirstOrDefault()
);
Now I use Any instead of Count.
It's plan looks like this:
My question is: what query should I use?
What query is more efficient?
I don't understand anything about execution plans :(
What important information do you see on these execution plans?
EDIT: Please drag pictures in a new tab, it will be more readable.
My question is: what query should I use? What query is more efficient?
To find the answer to that, I would simply take both of the SQL queries that have been generated (you can see the full query in the plan's xml) and run them in one SQL Server Management Studio query window together, with the "Include actual execution plan" option on.
The "Query cost (relative to the batch)" on the resulting plan will then tell you which one is more efficient on your actual schema/data.
What about EF's Any() vs Count() > 0 ? Which is more efficient?
In general if you just want to know if any of the "things" match your criteria, then it's often suggested you use Any() instead of Count() > 0 and let LINQ worry about constructing the most efficient way. In this use case I would just check the SQL and see.
See answers like this
For just IEnumerable<T>, then Any() will generally be quicker, as it only has to look at one iteration.
Related
I'm running this linq query which is a little big.
var events = _context.Event.OrderByDescending(e => e.StartDate).Where(e => e.IsPresentation == true).Where(e => e.IsCanceled == false).Where(e => e.StartDate > new DateTime());
And the page outputing the data from this query is taking too much time to load. Maybe because I'm using too many wheres.
I had the same issue using includes, and then includes, in a different query, but I divided the query, to improve the performance. But I'm trying to figure it out how to do the same thing in that situation, because I'm not using any include.
Overall, the performance of the query will largely depend on the table size, and availability of suitable indices.
A couple things I can note from that query:
This statement doesn't make a lot of sense: .Where(e => e.StartDate > new DateTime()). new DateTime() will initialize a DateTime from 01/01/0001. Any dates stored in an SQL Server DateTime column for example will be from 01/01/1753, so this seems rather moot. If the database/entity DateTime value is null-able, then .Where(e => e.StartDate.HasValue) would be more applicable. If the DateTime value is not null-able then this condition can be left off entirely.
So if the field is null-able, the Linq expression would look more like:
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate.HasValue)
.OrderByDescending(e => e.StartDate)
.ToList();
If it's not null-able:
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled)
.OrderByDescending(e => e.StartDate)
.ToList();
The next culprit to eliminate: Lazy Load proxy hits. Does the Event property have navigation properties to any other entities? If this is something like a web application and you are serializing entities to send back to the client, navigation property EF proxies can absolutely kill performance. Any code after this call that touches a navigation property will result in extra individual DB calls to lazy load these navigation properties. For methods that return lists of entities this can be critical. If an Event has a reference to something like a User and you load 1000 events referencing roughly 1000 users, when a serializer goes to serialize those 1000 events, it will "touch" each of the user references. This leads to ~1000 extra SQL statements effectively doing SELECT * FROM tblUser WHERE UserId = 1, SELECT * FROM tblUser WHERE UserId = 2... etc. etc. for each User ID in each Event. If you need these related entities you can Eager load them with Include(e => e.User) which will be faster than loading them individually, but this does mean loading a lot of extra data into memory that your client/code may not need. You can avoid the lazy load hits by turning off lazy loading & proxies, but this will leave these entities with #null references which means any code expecting an Event entity with any related details may get one of these partially loaded entities. (not good, the entity should always be in a complete or complete-able state) The final option is to use Select to populate a view model for your results. This can speed up queries considerably because you can have EF compose a query to pull back just the data you need from whatever entities rather than everything or triggering lazy loads.
For example if you just need an EventId, EventNumber, Name, StartDate, and a UserName to display for the events:
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate.HasValue)
.Select(e => new EventViewModel
{
EventId = e.EventId,
EventNumber = e.EventNumber,
Name = e.Name,
StartDate = e.StartDate,
UserName = e.User.Name
})
.OrderByDescending(e => e.StartDate)
.ToList();
This avoids any lazy load hits and reduces the query run to just the columns needed which can speed up queries significantly.
Next would be to look at the queries EF is running and their relevant execution plan. This can highlight missing/poor indexes, and any unexpected lazy load hits. The method for doing this would depend on your database, but involves running a Profiler against the DB to capture the SQL statements being run while you debug. From here you can capture the SQL statements that EF generates then run those manually against your database with any DB-side analysis tools. (such as SSMS with SQL Server to get an execution plan which can identify missing indexes) Serializer lazy load hits in a web application are detectable as a lot of extra SQL statements executing after your method appears to have completed, but before the data gets back to the client. This is the serializer "touching" proxies resulting in lots of additional queries that the server has to wait to complete before the data is returned to the client.
Lastly would be data volume. Any system that is expected to grow over time should consider the amount of data that can eventually be returned. Anything that returns lists of data over time should incorporate server-side pagination where the client sends a page size and page # where the server can translate this into a .Skip(pageNumber * pageSize).Take(pageSize) operation. (/w page # starting at 0) Most data grid and like components should support server-side pagination to send these arguments to their data load methods. These controls will need to know the total row count to set up pagination so you would need a method to return that count:
var rowCount = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate.HasValue)
.Count();
Same conditions, no order by, and a .Count() with no ToList() etc. will compose an efficient Count query.
This should give you a few things to check and tweak to eliminate your performance pitfalls.
You should store the value to one variable like var now = new DateTime()
Combine all of the conditions into one Where clause
Just OrderByDescending after Where clause, It means We just order based on actual data instead of all.
var now = new DateTime();
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate > now)
.OrderByDescending(e => e.StartDate);
Tips
You should re-arrange your condition based on actual data. For example:
.Where(e => 1 == 2 && 2 == 2 && 3 == 3)
As you can see, We no need to manipulate the rest of conditions from && 2 == 2 && 3 == 3 because of and condition.
one thing can be sorting at the end because there would be less items and then less time to sort,
BUT it really depends on your data distribution. Look if MOST of your data has e.IsPresentation == true, then the first "Where" does not reduce the data size for you, SO then you are again checking e.IsCanceled == false on e.g. 95 % your data. But assume only 10 % of your whole data is e.IsCanceled == false. So now if you apply e.IsPresentation == true , on that 10% in the second order, it takes much less time than before. So in big databases DB managers usually use different query plans! however the final result is the same. the process time is NOT same. hope it helps you.
I have two almost identical queries. The only difference is the Where clause.
Following a solution rebuild, the first response time for both queries is 20 seconds. For all following requests:
.Where(x => x.EnquiryId == id); returns in < 1 second
.Where(x => ids.Contains(x.EnquiryId)); always takes 20 seconds, even with a single id in the collection
What am I doing wrong? How can I select on multiple ids without such an immense performance hit?
Bizarrely the following where clause also takes 20 seconds: .Where(x => x.EnquiryId == ids.FirstOrDefault());
edit: AzureSQL (live) and SQLExpress2017 (dev) on the backend. Query is slow on both live and my dev machine.
edit: In SQL Server Profiler I'm not sure what to look at, but for the two queries each has an RPC:Completed EventClass. One of these (I'm guessing the fast one) is 22 lines long. The other is nearly 7000 lines long. So I guess my next question is how can I advise EF to not create such awful SQL?
update: for anyone else who has this issue, I found I can bypass the bad SQL generation through the use of a union rather than a contains, ie instead of .Where(x => ids.Contains(x.EnquiryId)); use a loop foreach (var id in ids) { q = q.Union(query.Where(x => x.EnquiryId == id)); }
You could try to use Any (I did not profile this so I don't know if this is faster).
.Where(x => ids.Any(id => id == x.EnquiryId))
I wonder why is Entity framework generating such an inefficient SQL query. In my code I expected the WHERE to act upon the INCLUDE:
db.Employment.Where(x => x.Active).Include(x => x.Employee).Where(x => x.Employee.UserID == UserID)
but I ended up with a double SQL JOIN:
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID], [e].[ID], [e].[Active], [e].[Address], [e].[BirthDate], [e].[CitizenshipID], [e].[City], [e].[CountryID], [e].[Email], [e].[FirstName], [e].[Gender], [e].[LastName], [e].[Note], [e].[Phone], [e].[PostalCode], [e].[TaxNumber], [e].[UserID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
INNER JOIN [Employee] AS [e] ON [x].[EmployeeID] = [e].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
I found out that this query will create better SQL:
db.Employment.Where(x => x.Active).Where(x => x.Employee.UserID == UserID)
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
However, the problem here that referenced entities are not retrieved from the DB.
Why don't two codes produce same SQLs?
The SQL is different because the statments are different.
Entity Framework does produce inefficient TSQL, it always has. By abstracting the subtleties that are necessary for SQL with good performance and replacing them with "belt and braces" nearly always work alternatives you sacrafice performance for utility.
If you need good performance, write the SQL yourself. Dapper works well for me. You can't realistically expect a "one size fits all" solution to come up with the best code for your specific situation. You can do this across the board or just where you need to.
Unless you have high volume or specific performance requirements get on with it and use whatever you find easiest. If you need to tune your queries to your database you are going to have learn the details of your database engine and implement the queries yourself. If you are expecting the next iteration of Entity Framework to be the magic bullet that allows you fast, efficient SQL data access with minimal knowledge, good luck.
P.S.
Off-topic but, NoSQL probably isn't the answer either, is just a different class of database.
I am using Entity Framework and I need to check if a product with name = "xyz" exists ...
I think I can use Any(), Exists() or First().
Which one is the best option for this kind of situation? Which one has the best performance?
Thank You,
Miguel
Okay, I wasn't going to weigh in on this, but Diego's answer complicates things enough that I think some additional explanation is in order.
In most cases, .Any() will be faster. Here are some examples.
Workflows.Where(w => w.Activities.Any())
Workflows.Where(w => w.Activities.Any(a => a.Title == "xyz"))
In the above two examples, Entity Framework produces an optimal query. The .Any() call is part of a predicate, and Entity Framework handles this well. However, if we make the result of .Any() part of the result set like this:
Workflows.Select(w => w.Activities.Any(a => a.Title == "xyz"))
... suddenly Entity Framework decides to create two versions of the condition, so the query does as much as twice the work it really needed to. However, the following query isn't any better:
Workflows.Select(w => w.Activities.Count(a => a.Title == "xyz") > 0)
Given the above query, Entity Framework will still create two versions of the condition, plus it will also require SQL Server to do an actual count, which means it doesn't get to short-circuit as soon as it finds an item.
But if you're just comparing these two queries:
Activities.Any(a => a.Title == "xyz")
Activities.Count(a => a.Title == "xyz") > 0
... which will be faster? It depends.
The first query produces an inefficient double-condition query, which means it will take up to twice as long as it has to.
The second query forces the database to check every item in the table without short-circuiting, which means it could take up to N times longer than it has to, depending on how many items need to be evaluated before finding a match. Let's assume the table has 10,000 items:
If no item in the table matches the condition, this query will take roughly half the time as the first query.
If the first item in the table matches the condition, this query will take roughly 5,000 times longer than the first query.
If one item in the table is a match, this query will take an average of 2,500 times longer than the first query.
If the query is able to leverage an index on the Title and key columns, this query will take roughly half the time as the first query.
So in summary, IF you are:
Using Entity Framework 4 (since newer versions might improve the query structure) Entity Framework 6.1 or earlier (since 6.1.1 has a fix to improve the query), AND
Querying directly against the table (as opposed to doing a sub-query), AND
Using the result directly (as opposed to it being part of a predicate), AND
Either:
You have good indexes set up on the table you are querying, OR
You expect the item not to be found the majority of the time
THEN you can expect .Any() to take as much as twice as long as .Count(). For example, a query might take 100 milliseconds instead of 50. Or 10 instead of 5.
IN ANY OTHER CIRCUMSTANCE .Any() should be at least as fast, and possibly orders of magnitude faster than .Count().
Regardless, until you have determined that this is actually the source of poor performance in your product, you should care more about what's easy to understand. .Any() more clearly and concisely states what you are really trying to figure out, so stick with that.
Any translates into "Exists" at the database level. First translates into Select Top 1 ... Between these, Exists will out perform First because the actual object doesn't need to be fetched, only a Boolean result value.
At least you didn't ask about .Where(x => x.Count() > 0) which requires the entire match set to be evaluated and iterated over before you can determine that you have one record. Any short-circuits the request and can be significantly faster.
One would think Any() gives better results, because it translates to an EXISTS query... but EF is awfully broken, generating this (edited):
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit) WHEN ( NOT EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent2]
WHERE Condition
)) THEN cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
Instead of:
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS [C1]
FROM [MyTable] AS [Extent1]
WHERE Condition
)) THEN cast(1 as bit)
ELSE cast(0 as bit) END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
...basically doubling the query cost (for simple queries; it's even worse for complex ones)
I've found using .Count(condition) > 0 is faster pretty much always (the cost is exactly the same as a properly-written EXISTS query)
Ok, I decided to try this out myself. Mind you, I'm using the OracleManagedDataAccess provider with the OracleEntityFramework, but I'm guessing it produces compliant SQL.
I found that First() was faster than Any() for a simple predicate. I'll show the two queries in EF and the SQL that was generated. Mind you, this is a simplified example, but the question was asking whether any, exists or first was faster for a simple predicate.
var any = db.Employees.Any(x => x.LAST_NAME.Equals("Davenski"));
So what does this resolve to in the database?
SELECT
CASE WHEN ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME")
)) THEN 1 ELSE 0 END AS "C1"
FROM ( SELECT 1 FROM DUAL ) "SingleRowTable1"
It's creating a CASE statement. As we know, ANY is merely syntatic sugar. It resolves to an EXISTS query at the database level. This happens if you use ANY at the database level as well. But this doesn't seem to be the most optimized SQL for this query.
In the above example, the EF construct Any() isn't needed here and merely complicates the query.
var first = db.Employees.Where(x => x.LAST_NAME.Equals("Davenski")).Select(x=>x.ID).First();
This resolves to in the database as:
SELECT
"Extent1"."ID" AS "ID"
FROM "MYSCHEMA"."EMPLOYEES" "Extent1"
WHERE ('Davenski' = "Extent1"."LAST_NAME") AND (ROWNUM <= (1) )
Now this looks like a more optimized query than the initial query. Why? It's not using a CASE ... THEN statement.
I ran these trivial examples several times, and in ALMOST every case, (no pun intended), the First() was faster.
In addition, I ran a raw SQL query, thinking this would be faster:
var sql = db.Database.SqlQuery<int>("SELECT ID FROM MYSCHEMA.EMPLOYEES WHERE LAST_NAME = 'Davenski' AND ROWNUM <= (1)").First();
The performance was actually the slowest, but similar to the Any EF construct.
Reflections:
EF Any doesn't exactly map to how you might use Any in the database. I could have written a more optimized query in Oracle with ANY than what EF produced without the CASE THEN statement.
ALWAYS check your generated SQL in a log file or in the debug output window.
If you're going to use ANY, remember it's syntactic sugar for EXISTS. Oracle also uses SOME, which is the same as ANY. You're generally going to use it in the predicate as a replacement for IN. In this case it generates a series of ORs in your WHERE clause. The real power of ANY or EXISTS is when you're using Subqueries and are merely testing for the EXISTENCE of related data.
Here's an example where ANY really makes sense. I'm testing for the EXISTENCE of related data. I don't want to get all of the records from the related table. Here I want to know if there are Surveys that have Comments.
var b = db.Survey.Where(x => x.Comments.Any()).ToList();
This is the generated SQL:
SELECT
"Extent1"."SURVEY_ID" AS "SURVEY_ID",
"Extent1"."SURVEY_DATE" AS "SURVEY_DATE"
FROM "MYSCHEMA"."SURVEY" "Extent1"
WHERE ( EXISTS (SELECT
1 AS "C1"
FROM "MYSCHEMA"."COMMENTS" "Extent2"
WHERE ("Extent1"."SURVEY_ID" = "Extent2"."SURVEY_ID")
))
This is optimized SQL!
I believe the EF does a wonderful job generating SQL. But you have to understand how the EF constructs map to DB constructs else you can create some nasty queries.
And probably the best way to get a count of related data is to do an explicit Load with a Collection Query count. This is far better than the examples provided in prior posts. In this case you're not loading related entities, you're just obtaining a count. Here I'm just trying to find out how many Comments I have for a particular Survey.
var d = db.Survey.Find(1);
var e = db.Entry(d).Collection(f => f.Comments)
.Query()
.Count();
Any() and First() is used with IEnumerable which gives you the flexibility for evaluating things lazily. However Exists() requires List.
I hope this clears things out for you and help you in deciding which one to use.
I have ADO.NET EF expression like:
db.Table1.Select(
x => new { ..., count = db.Table2.Count(y => y.ForeignKey.ID == x.ID) })
Does I understand correctly it's translated into several SQL client-server requests and may be refactored for better performance?
Thank you in advance!
Yes - the expression will get translated (in the best way it can) to a SQL query.
And just like any T-SQL query, an EF (or L2SQL) query expression can be refactored for performance.
Why not run SQL profiler in the background to see what it is getting executed, and try and optimize the raw T-SQL first - which will help optimize the expression.
Or if you have LinqPad, just optimize the T-SQL query and get LinqPad to write your query for you.
Also, im not really sure why you have specified the delegate for the Count() expression.
You can simply do this:
var query= from c in db.Table1
select new { c.CustomerID, OrderCount = c.Table2s.Count() };
The answer is NO - this query will be translated into one client-to-RDBMS request.
RPM1984 advised to use LinqPad. LinqPad showed that the query will be translated into very straightforward SQL expression. Approach with grouping will be translated into another SQL expression but still will be executed in one request.