Let's say I have an Entity Framwork query
var query = db.Entities
.FancyQueryStuff()
.Where(GetFilter()) // *
.OrderBy(GetSort()) // *
.Take(GetNumberOfRows()) // *
;
and figure that this query is very slow. Testing reveals that the following rewrite is much faster:
var ids = db.Entities
.FancyQueryStuff()
.Where(GetFilter()) // *
.OrderBy(GetSort()) // *
.Take(GetNumberOfRows()) // *
.Select(x => x.Id)
.ToArray()
;
var query = db.Entries
.FancyQueryStuff()
.OrderBy(GetSort()) // *
.Where(x => ids.Contains(x.Id));
Whether that is quicker depends on a lot of things, including the sql database used, but I have a scenario in which this is the case with SQL Server and a particular query doing heavy joining.
Now the problem I have is that I want to use libraries that take IQueryables and apply Where, OrderBy, Take and Skip internally according to UI information the get from somewhere else (DevExpress/Telerik grids with paging, where the user clicks on captions to sort, etc.).
That means I have to write the query in a form where all the rows marked with an asterisk can be applied by a third-party framework.
With Devextreme, for example, you have a method that takes the query plus a data structure representing the filter/sorting/paging in a custom format and returns the query results you are supposed to pass to a client in an html application:
var result = DataSourceLoader.Load(query, loadOptions);
DataSourceLoader.Load applies everything of the kind I marked with an asterisk to the end of the query, executes it and returns the result.
I guess it's possible to do what I want with some heavy guns of linq magic (dynamic linq?), but before I try myself I thought maybe someone already has a snippet ready for this probably not too uncommon use case.
Related
I've been doing some research on how to set up a new GraphQL API project, but am running into some basic conceptual? problems in trying to find out how to do pagination and nested database queries efficiently.
I'd appreciate any pointers or advice!
Let's say we get a graphql query like so:
articles(limit: 10) {
title
content
comments(limit: 5) {
postedAt
text
}
}
A typical ORM, assuming eager loading of the nested type, could translate this type of query into an sql query like this, and then loop over the results to manually group the comments together and hydrate it all.
select a.title, a.content, c.posted_at, c.text
from articles as a
left join comments as c on c.article_id = a.id
limit ???
But so far, I've only ever seen ORMs like Doctrine (php) and Sequelize (js) fail in doing pagination correctly in these cases. They can't correctly handle page sizes, because there's no way to express the limit in this sql query's setup.
=> Am I correct in seeing this problem? Or am I missing something crucial, are ORMs able to do pagination with eagerly loaded data somehow?
So now I just recently came across the lateral join type in Postgres, which seems to solve this issue, provided we also add some json trickery:
select a.title, a.content, t.data as comments
from articles as a
join lateral (
select json_agg(sub.*) as data
from (
select c.posted_at, c.text
from comments as c
where c.article_id = a.id
limit 5
) sub
) t on true
limit 20;
(I think I've seen this kind of lateral + json trickery stuff in how Hasura and Postgraphile transform to sql, so I don't this it's unwarranted / bad engineering.)
=> Is there any ORM out there (except hasura/postgraphile), possibly Postgres-specific, that use this kind of lateral and json stuff, instead of the typical method described above?
Lastly, my research has taught me that in building a graphql api, you'll typically find yourself data-loading (batching) nested queries, instead of eager-loading them from the "parent" query. So, for example, this would be without data-loading:
class ArticleResolver {
comments(article) {
db.query("select ... from comments where ... = {article.id}");
}
and then this would be with data-loading:
class ArticleResolver {
commentsDataLoader = new DataLoader(articleIds => {
return db.query("select ... from comments where ... in {articleIds}");
});
comments(article) {
return this.commentsDataLoader.load(article.id);
}
But, as soon as you want to start adding parameters like limit: 5 to nested queries, this data-loading query gets as complicated as the original question, so we're back where we were :)
=> Is there a conventional way, of some standard practices, for dealing with this setup? Is there any known way / library so easily write out resolvers like, for example, this:
class ArticleResolver
...
comments(article, limit) {
return db.somehowMagicallyDataloaded("select * from comments ... = {article.id} limit {limit}")
}
I'm running this linq query which is a little big.
var events = _context.Event.OrderByDescending(e => e.StartDate).Where(e => e.IsPresentation == true).Where(e => e.IsCanceled == false).Where(e => e.StartDate > new DateTime());
And the page outputing the data from this query is taking too much time to load. Maybe because I'm using too many wheres.
I had the same issue using includes, and then includes, in a different query, but I divided the query, to improve the performance. But I'm trying to figure it out how to do the same thing in that situation, because I'm not using any include.
Overall, the performance of the query will largely depend on the table size, and availability of suitable indices.
A couple things I can note from that query:
This statement doesn't make a lot of sense: .Where(e => e.StartDate > new DateTime()). new DateTime() will initialize a DateTime from 01/01/0001. Any dates stored in an SQL Server DateTime column for example will be from 01/01/1753, so this seems rather moot. If the database/entity DateTime value is null-able, then .Where(e => e.StartDate.HasValue) would be more applicable. If the DateTime value is not null-able then this condition can be left off entirely.
So if the field is null-able, the Linq expression would look more like:
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate.HasValue)
.OrderByDescending(e => e.StartDate)
.ToList();
If it's not null-able:
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled)
.OrderByDescending(e => e.StartDate)
.ToList();
The next culprit to eliminate: Lazy Load proxy hits. Does the Event property have navigation properties to any other entities? If this is something like a web application and you are serializing entities to send back to the client, navigation property EF proxies can absolutely kill performance. Any code after this call that touches a navigation property will result in extra individual DB calls to lazy load these navigation properties. For methods that return lists of entities this can be critical. If an Event has a reference to something like a User and you load 1000 events referencing roughly 1000 users, when a serializer goes to serialize those 1000 events, it will "touch" each of the user references. This leads to ~1000 extra SQL statements effectively doing SELECT * FROM tblUser WHERE UserId = 1, SELECT * FROM tblUser WHERE UserId = 2... etc. etc. for each User ID in each Event. If you need these related entities you can Eager load them with Include(e => e.User) which will be faster than loading them individually, but this does mean loading a lot of extra data into memory that your client/code may not need. You can avoid the lazy load hits by turning off lazy loading & proxies, but this will leave these entities with #null references which means any code expecting an Event entity with any related details may get one of these partially loaded entities. (not good, the entity should always be in a complete or complete-able state) The final option is to use Select to populate a view model for your results. This can speed up queries considerably because you can have EF compose a query to pull back just the data you need from whatever entities rather than everything or triggering lazy loads.
For example if you just need an EventId, EventNumber, Name, StartDate, and a UserName to display for the events:
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate.HasValue)
.Select(e => new EventViewModel
{
EventId = e.EventId,
EventNumber = e.EventNumber,
Name = e.Name,
StartDate = e.StartDate,
UserName = e.User.Name
})
.OrderByDescending(e => e.StartDate)
.ToList();
This avoids any lazy load hits and reduces the query run to just the columns needed which can speed up queries significantly.
Next would be to look at the queries EF is running and their relevant execution plan. This can highlight missing/poor indexes, and any unexpected lazy load hits. The method for doing this would depend on your database, but involves running a Profiler against the DB to capture the SQL statements being run while you debug. From here you can capture the SQL statements that EF generates then run those manually against your database with any DB-side analysis tools. (such as SSMS with SQL Server to get an execution plan which can identify missing indexes) Serializer lazy load hits in a web application are detectable as a lot of extra SQL statements executing after your method appears to have completed, but before the data gets back to the client. This is the serializer "touching" proxies resulting in lots of additional queries that the server has to wait to complete before the data is returned to the client.
Lastly would be data volume. Any system that is expected to grow over time should consider the amount of data that can eventually be returned. Anything that returns lists of data over time should incorporate server-side pagination where the client sends a page size and page # where the server can translate this into a .Skip(pageNumber * pageSize).Take(pageSize) operation. (/w page # starting at 0) Most data grid and like components should support server-side pagination to send these arguments to their data load methods. These controls will need to know the total row count to set up pagination so you would need a method to return that count:
var rowCount = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate.HasValue)
.Count();
Same conditions, no order by, and a .Count() with no ToList() etc. will compose an efficient Count query.
This should give you a few things to check and tweak to eliminate your performance pitfalls.
You should store the value to one variable like var now = new DateTime()
Combine all of the conditions into one Where clause
Just OrderByDescending after Where clause, It means We just order based on actual data instead of all.
var now = new DateTime();
var events = _context.Event
.Where(e => e.IsPresentation && !e.IsCanceled && e.StartDate > now)
.OrderByDescending(e => e.StartDate);
Tips
You should re-arrange your condition based on actual data. For example:
.Where(e => 1 == 2 && 2 == 2 && 3 == 3)
As you can see, We no need to manipulate the rest of conditions from && 2 == 2 && 3 == 3 because of and condition.
one thing can be sorting at the end because there would be less items and then less time to sort,
BUT it really depends on your data distribution. Look if MOST of your data has e.IsPresentation == true, then the first "Where" does not reduce the data size for you, SO then you are again checking e.IsCanceled == false on e.g. 95 % your data. But assume only 10 % of your whole data is e.IsCanceled == false. So now if you apply e.IsPresentation == true , on that 10% in the second order, it takes much less time than before. So in big databases DB managers usually use different query plans! however the final result is the same. the process time is NOT same. hope it helps you.
I have never yet used .AsEnumerable() in an EntityFramework query.
See the below example and tell me why they use .AsEnumerable() before Select ?
Could they not just use Select directly?
Please tell me the reason for the usage of .AsEnumerable() here in below query.
Why did they use .ToArray() instead of .Tolist() ?
private IEnumerable<AutoCompleteData> GetAutoCompleteData(string searchTerm)
{
using (var context = new AdventureWorksEntities())
{
var results = context.Products
.Include("ProductSubcategory")
.Where(p => p.Name.Contains(searchTerm)
&& p.DiscontinuedDate == null)
.AsEnumerable()
.Select(p => new AutoCompleteData
{
Id = p.ProductID,
Text = BuildAutoCompleteText(p)
})
.ToArray();
return results;
}
}
The difference between AsEnumerable and AsQueryable is that the enumerable contains all information to create an enumerator. Once you've got the enumerator you can ask for the first element, and if there is one, you can get the next one.
The Queryable does not hold the information to create the enumerator. It holds an Expression and a Provider. The Provider knows which process must execute the Expression and which language this process uses. Quite often the other process is a database management system, and the language is SQL.
The result of a Queryable.Select(...) is still an IQueryable, meaning that the query is not performed yet. The Select function only changed the Expression.
Only if you ask for the Enumerator, either explicitly by calling GetEnumerator(), or implicitly by calling foreach, or one of the non-deferred execution functions like ToList(), ToDictionary(), FirstOrDefault(), Sum(), the Provider will translate the expression into the format that the execution process understands and execute the query. Once the data is transported to the local process the enumerator is created.
Alas, sometimes you want to call your own functions in your query. SQL does not know these functions, and thus the Provider can't translate such Expressions into SQL. In fact, the provider of DbContext does not even know all Linq functions. See supported and unsupported Linq methods
That is the moment when you use AsEnumerable(). If you ask for the Enumerator (in your foreach for example), the Provider will translate the Expression until AsEnumerable; send it to the execution process and transport all data to local process. After that, the query will be AsEnumerable: the rest of the LINQ will be performed in local memory, and thus your local functions can be called.
You could of course use ToList() to fetch all data to local memory and continue your linq after that. But that would be a waste if you'd only want the first element, or every other one.
This brings me to the final remark: the transport of the data from the DBMS to your local memory is one of the slower parts. Try to limit this transport to only the data you'll actually use.
For example: if you have a one-to-many relation between a Teacher and his Students, don't fetch the Teacher and his Students, because you'll transport Student.TeacherId many times, and they will all have the same value as Teacher.Id. Instead, only select the data you really want to use
Not all Select projections, Where predicates, and Aggregations can be translated from C# Expressions into native database queries - in your case, the full LINQ expression attempts to construct a AutoCompleteData class with using a custom function BuildAutoCompleteText to set one of its properties - this cannot be trivially converted into native database code like SQL.
In your case, AsEnumerable serves to terminate the work which will be done in SQL before this will be executed in SQL.
i.e.
.Include("ProductSubcategory")
.Where(p => p.Name.Contains(searchTerm)
&& p.DiscontinuedDate == null)
will be executed in SQL, roughly as a JOIN to ProductSubcategory, and a WHERE predicate translated from your Products such as:
Product.Name LIKE '%' + #SearchTerm + '%' AND Product.DiscontinuedDate IS NULL
All work subsequent to the AsEnumerable (i.e. the projection of the results to AutoCompleteData objects) will be done in-memory with LINQ to objects.
ToArray and ToList will both execute (materialize) the result, but into different data structures. In your example, neither materialisation is required - since the return type is IEnumerable<AutoCompleteData> - the caller of the function might execute .Any() or First() which would render full materialisation wasteful - I would recommend you remove .ToArray() altogether - since the using statement controls the SQL lifespan is protected by the AsEnumerable() materialization, there is no issue with connection lifespans here.
tell me the intention of usage of .AsEnumerable() here in below query?
In this particular example AsEnumerable() was used to bring the data back to the client, because EF has no idea how to map BuildAutoCompleteText() to SQL query.
they could use select directly.....is not it?
No, unless you define custom function BuildAutoCompleteText on SQL Server and make EF aware of that function.
why they use .ToArray(); instead of Tolist() ?
In this case it does not matter both implement IEnumerable<T>
This is a very weird problem
In short
var q = (some query).Count();
Gives my a number and
var q = (some query).ToList().Count();
Gives me entirely different number...
with mentioning that (some query) has two includes (joins)
is there a sane explanation for that???
EDIT: here is my query
var q = db.membership_renewals.Include(i => i.member).Include(i => i.sport).Where(w => w.isDeleted == false).Count();
this gives me a wrong number
and this:
var q = db.membership_renewals.Include(i => i.member).Include(i => i.sport).Where(w => w.isDeleted == false).ToList().Count();
Gives me accurate number..
EDIT 2
Wher I wrote my query as linq query it worked perfectly...
var q1 = (from d in db.membership_renewals where d.isDeleted == false join m in db.members on d.mr_memberId equals m.m_id join s in db.sports on d.mr_sportId equals s.s_id select d.mr_id).Count();
I think the problem that entity framework doesn't execute the joins in the original query but forced to execute them in (ToList())...
I Finally figured out what's going on...
The database tables are not linked together in the database (there are no relationship or constraints defined in the database itself) so the code doesn't execute the (inner join) part.
However my classes on the other hand are well written so when I perform (ToList()) it automatically ignores the unbound rows...
And when I wrote the linq query defining the relation ship keys (primary and foreign) it worked alright because now the database understands my relation between tables...
Thanks everyone you've been great....
My guess is IQueryable gives a smaller number cause not all the objects are loaded, kind of like a stream in Java, but IQueryable.toList().count() forces the Iqueryable to load all the data and it is traversed by the list constructor and stored in the list so IQueryable.toList().Count() is the accurate answer. This is based on 5 minutes of search on MSDN.
The idea is the underlying datastore of the IQueryable is a database iterator so it executes differently every time because it executes the query again on the database, so if you call it twice against the same table, and the data has changed you get different results. This is called delayed execution. But when you say IQueryable.ToList() you force the iterator to do the whole iteration once and dump the results in a list which is constant
I'm using QueryDSL with JPA.
I want to query some properties of an entity, it's like this:
QPost post = QPost.post;
JPAQuery q = new JPAQuery(em);
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name);
It works fine.
If i want to query a relation property, e.g. comments of a post:
List<Set<Comment>> rows = q.from(post).where(...).list(post.comments);
It's also fine.
But when I want to query relation and simple properties together, e.g.
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name, post.comments);
Then something went wrong, generiting a bad SQL syntax.
Then I realized that it's not possible to query them together in one SQL statement.
Is it possible that QueryDSL would somehow deal with relations and generate additional queries (just like what hibernate does with lazy relations), and load the results in?
Or should I just query twice, and then merge both result lists?
P.S. what i actually want is each post with its comments' ids. So a function to concat each post's comment ids is better, is this kind of expressin possible?
q.list(post.id, post.name, post.comments.all().id.join())
and generate a subquery sql like (select group_concat(c.id) from comments as c inner join post where c.id = post.id)
Querydsl JPA is restricted to the expressivity of JPQL, so what you are asking for is not possible with Querydsl JPA. You can though try to express it with Querydsl SQL. It should be possible. Also as you don't project entities, but literals and collections it might work just fine.
Alternatively you can load the Posts with only the Comment ids loaded and then project the id, name and comment ids to something else. This should work when accessors are annotated.
The simplest thing would be to query for Posts and use fetchJoin for comments, but I'm assuming that's too slow for you use case.
I think you ought to simply project required properties of posts and comments and group the results by hand (if required). E.g.
QPost post=...;
QComment comment=..;
List<Tuple> rows = q.from(post)
// Or leftJoin if you want also posts without comments
.innerJoin(comment).on(comment.postId.eq(post.id))
.orderBy(post.id) // Could be used to optimize grouping
.list(new QTuple(post.id, post.name, comment.id));
Map<Long, PostWithComments> results=...;
for (Tuple row : rows) {
PostWithComments res = results.get(row.get(post.id));
if (res == null) {
res = new PostWithComments(row.get(post.id), row.get(post.name));
results.put(res.getPostId(), res);
}
res.addCommentId(row.get(comment.id));
}
NOTE: You cannot use limit nor offset with this kind of queries.
As an alternative, it might be possible to tune your mappings so that 1) Comments are always lazy proxies so that (with property access) Comment.getId() is possible without initializing the actual object and 2) using batch fetch* on Post.comments to optimize collection fetching. This way you could just query for Posts and then access id's of their comments with little performance hit. In most cases you shouldn't even need those lazy proxies unless your Comment is very fat. That kind of code would certainly look nicer without low level row handling and you could also use limit and offset in your queries. Just keep an eye on your query log to make sure everything works as intended.
*) Batch fetching isn't directly supported by JPA, but Hibernate supports it through mapping and Eclipselink through query hints.
Maybe some day Querydsl will support this kind of results grouping post processing out-of-box...