How to combine LINQ contains query with count on all records and retrieve records with offset - entity-framework

I have page with pagination and search on movies, the movies table has 388262 records. Without using search the below code works fast. But when using search (vm.Search is filled) it becomes slow to retrieve the records.
So basically it's doing the contains/like query two times, one for returning the results with the offset and one for the count without the offset, this is correct, but is there a way to combine this for improved performance?
The weird thing is, when I use the queries from the log and execute them directly on the database using SSMS (SQL Server Management Studio). It executes the two queries rather fast in around 2 seconds. While it takes around 10 seconds when it's executed with the code/linq?
The code where I do the LINQ:
public IActionResult Index(MovieIndexViewModel vm) {
IQueryable<Movie> query = _movieRepository.GetQueryable().AsNoTracking()
.Select(m => new Movie {
movie_id = m.movie_id,
title = m.title,
description = m.description
});
if (!string.IsNullOrWhiteSpace(vm.Search)) {
query = query.Where(m => m.title.Contains(vm.Search));
}
vm.Movies = query.Skip(_pageSize * (vm.Page - 1)).Take(_pageSize).ToList();
vm.PageSize = _pageSize;
vm.TotalItemCount = query.Count();
return View(vm);
}
SQL log from the the line with Skip and Take:
SELECT [m].[movie_id], [m].[title], [m].[description]
FROM [Movie] AS [m]
WHERE [m].[title] LIKE ('%' + #__vm_Search_0) + '%'
ORDER BY ##ROWCOUNT
OFFSET 0 ROWS FETCH NEXT 30 ROWS ONLY
SQL log from the line with Count:
SELECT COUNT(*)
FROM [Movie] AS [m]
WHERE [m].[title] LIKE ('%' + #__vm_Search_0) + '%'

The below code will run both queries concurrently. However this code will only perform ~[latency to SQL server] faster, which should be in the order of 10ms.
public async Task<IActionResult> Index(MovieIndexViewModel vm) {
IQueryable<Movie> query = _movieRepository.GetQueryable().AsNoTracking()
.Select(m => new Movie {
movie_id = m.movie_id,
title = m.title,
description = m.description
});
if (!string.IsNullOrWhiteSpace(vm.Search)) {
query = query.Where(m => m.title.Contains(vm.Search));
}
var moviesTask = query.Skip(_pageSize * (vm.Page - 1)).Take(_pageSize).ToListAsync();
var countTask = query.CountAsync();
vm.Movies = await moviesTask;
vm.PageSize = _pageSize;
vm.TotalItemCount = await countTask;
return View(vm);
}
I suspect you aren't looking for a 10ms performance tweak. Instead, focus on the query plan.
Things you should try.
Add .OrderBy(m => m.title) into query
Add more indices
Clear the query cache
Things to check
Are you accurately benchmarking? Is that 8 seconds of JITer followed by 2 seconds of actual run?
Any weird HTTP systems like proxies in the way?
DNS?
Are you running in release or debug mode?

Related

MongoDB bulkwrite with upsert option is taking more than a minute for 100k+ records. Is it possible to improve?

I have read the following two questions but did not see a satisfactory solution or suggestion:
Performance degrade with Mongo when using bulkwrite with upsert
Mongodb/Mongoose bulkwrite(upsert) performance issues
I have an API that will be receiving ~80k objects every 5 - 6 seconds.
These objects will need to be inserted if they do not exist and updated if they do exist.
The collection has a unique index (of type asc 1), which I use in the filter expression below.
Upserting 100k documents using the code below takes more than a minute, which is too long for my requirement (instantiating the objects takes less than a second).
MongoClient _mongoClient = new MongoClient("connection string");
IMongoDatabase _hotsauceOdds = _mongoClient.GetDatabase("DatabaseName");
var collection = _hotsauceOdds.GetCollection<Person>("TestMongo");
List<Person> peeps = new List<Person>();
for (int i = 0; i < 100000; i++)
{
peeps.Add(new Person { Age = i, Name = Guid.NewGuid().ToString() });
}
var models = peeps.Select(item => new ReplaceOneModel<Person>(new ExpressionFilterDefinition<Person>(doc => doc.Name == item.Name), item) { IsUpsert = true });
await collection.BulkWriteAsync(models);
I am testing this on the free tier of mongo atlas.
Is it possible to make this operation run more quickly, or is this as good as it gets for mongodb?

Mongo find query in batches

My aim is to query MongoDB from my spring boot application, such that the query results are processed in my application in batches, as it may run out of memory if the resultSet returned is too big.
public void process(String productId) {
MongoIterable<MongoOrder> orders = getCollection().find(eq("product_id", productId)).batchSize(10000);
while (orders.iterator().hasNext()) {
// some processing
}
}
private MongoCollection<MongoOrder> getCollection() {
return mongoClient.getDatabase()
.getCollection("orders", MongoOrder.class);
}
In this above code snippet, I am trying to query the Mongo collection in batches of 10k. Will this work, or, does this bring into memory all the matching objects and then processes it one by one?
Had this been SQL, I would have queried with limits like LIMIT 0, 10000, then LIMIT 10000, 10000 and so on. I want to do a similar thing here, with MongoDB.
The following snippet should work:
public void process(String productId) {
int offset = 10000;
int curr = 0;
List<MongoOrder> orders = StreamSupport.stream(getCollection().find(eq("product_id", productId)).skip(curr).limit(offset).spliterator(), false).collect(Collectors.toList());
while (!CollectionUtils.isEmpty(orders)) {
// some processing
curr += offset;
orders = StreamSupport.stream(getCollection().find(eq("product_id", productId)).skip(curr).limit(offset).spliterator(), false).collect(Collectors.toList());
}
}
Here, we are querying mongo in batches of 10k, and our application will also bring into memory only 10k objects at a time, thereby, not having a situation wherein we will run out of heap memory.

Count in jpa without getting result [duplicate]

I like the idea of Named Queries in JPA for static queries I'm going to do, but I often want to get the count result for the query as well as a result list from some subset of the query. I'd rather not write two nearly identical NamedQueries. Ideally, what I'd like to have is something like:
#NamedQuery(name = "getAccounts", query = "SELECT a FROM Account")
.
.
Query q = em.createNamedQuery("getAccounts");
List r = q.setFirstResult(s).setMaxResults(m).getResultList();
int count = q.getCount();
So let's say m is 10, s is 0 and there are 400 rows in Account. I would expect r to have a list of 10 items in it, but I'd want to know there are 400 rows total. I could write a second #NamedQuery:
#NamedQuery(name = "getAccountCount", query = "SELECT COUNT(a) FROM Account")
but it seems a DRY violation to do that if I'm always just going to want the count. In this simple case it is easy to keep the two in sync, but if the query changes, it seems less than ideal that I have to update both #NamedQueries to keep the values in line.
A common use case here would be fetching some subset of the items, but needing some way of indicating total count ("Displaying 1-10 of 400").
So the solution I ended up using was to create two #NamedQuerys, one for the result set and one for the count, but capturing the base query in a static string to maintain DRY and ensure that both queries remain consistent. So for the above, I'd have something like:
#NamedQuery(name = "getAccounts", query = "SELECT a" + accountQuery)
#NamedQuery(name = "getAccounts.count", query = "SELECT COUNT(a)" + accountQuery)
.
static final String accountQuery = " FROM Account";
.
Query q = em.createNamedQuery("getAccounts");
List r = q.setFirstResult(s).setMaxResults(m).getResultList();
int count = ((Long)em.createNamedQuery("getAccounts.count").getSingleResult()).intValue();
Obviously, with this example, the query body is trivial and this is overkill. But with much more complex queries, you end up with a single definition of the query body and can ensure you have the two queries in sync. You also get the advantage that the queries are precompiled and at least with Eclipselink, you get validation at startup time instead of when you call the query.
By doing consistent naming between the two queries, it is possible to wrap the body of the code to run both sets just by basing the base name of the query.
Using setFirstResult/setMaxResults do not return a subset of a result set, the query hasn't even been run when you call these methods, they affect the generated SELECT query that will be executed when calling getResultList. If you want to get the total records count, you'll have to SELECT COUNT your entities in a separate query (typically before to paginate).
For a complete example, check out Pagination of Data Sets in a Sample Application using JSF, Catalog Facade Stateless Session, and Java Persistence APIs.
oh well you can use introspection to get named queries annotations like:
String getNamedQueryCode(Class<? extends Object> clazz, String namedQueryKey) {
NamedQueries namedQueriesAnnotation = clazz.getAnnotation(NamedQueries.class);
NamedQuery[] namedQueryAnnotations = namedQueriesAnnotation.value();
String code = null;
for (NamedQuery namedQuery : namedQueryAnnotations) {
if (namedQuery.name().equals(namedQueryKey)) {
code = namedQuery.query();
break;
}
}
if (code == null) {
if (clazz.getSuperclass().getAnnotation(MappedSuperclass.class) != null) {
code = getNamedQueryCode(clazz.getSuperclass(), namedQueryKey);
}
}
//if not found
return code;
}

What are the ways to optimize Entity Framework queries with Contains()?

Wee load large object graph from DB.
The query has many Includes and Where()uses Contains() to filter the final result.
Contains is called for the collection containing about thousand entries.
The profiler shows monstrous human-unreadable SQL.
The query cannot be precompiled because of Contains().
Is there any ways for optimization of such queries?
Update
public List<Vulner> GetVulnersBySecurityObjectIds(int[] softwareIds, int[] productIds)
{
var sw = new Stopwatch();
var query = from vulner in _businessModel.DataModel.VulnerSet
join vt in _businessModel.DataModel.ObjectVulnerTieSet.Where(ovt => softwareIds.Contains(ovt.SecurityObjectId))
on vulner.Id equals vt.VulnerId
select vulner;
var result = ((ObjectQuery<Vulner>)query.OrderBy(v => v.Id).Distinct())
.Include("Descriptions")
.Include("Data")
.Include("VulnerStatuses")
.Include("GlobalIdentifiers")
.Include("ObjectVulnerTies")
.Include("Object.ProductObjectTies.Product")
.Include("VulnerComment");
//Если переданы конкретные продукты, добавляем фильтрацию
if (productIds.HasValues())
result = (ObjectQuery<Vulner>)result.Where(v => v.Object.ProductObjectTies.Any(p => productIds.Contains(p.ProductId)));
sw.Start();
var str = result.ToTraceString();
sw.Stop();
Debug.WriteLine("Сборка запроса заняла {0} секунд.", sw.Elapsed.TotalSeconds);
sw.Restart();
var list = result.ToList();
sw.Stop();
Debug.WriteLine("Получение уязвимостей заняло {0} секунд.", sw.Elapsed.TotalSeconds);
return list;
}
It's almost certain that splitting the query in pieces performs better, in spite of more db round trips. It is always advised to limit the number of includes, because they not only blow up the size and complexity of the query (as you noticed) but also blow up the result set both in length and in width. Moreover, they often get translated into outer joins.
Apart from that, using Contains the way you do is OK.
Sorry, it is hard to be more specific without knowing your data model and the size of the tables involved.

Entity Framework Timeout

I have been trying to figure out how to optimize the following query for the past few days and just not having much luck. Right now my test db is returning about 300 records with very little nested data, but it's taking 4-5 seconds to run and the SQL being generated by LINQ is awfully long (too long to include here). Any suggestions would be very much appreciated.
To sum up this query, I'm trying to return a somewhat flattened "snapshot" of a client list with current status. A Party contains one or more Clients who have Roles (ASPNET Role Provider), Journal is returning the last 1 journal entry of all the clients in a Party, same goes for Task, and LastLoginDate, hence the OrderBy and FirstOrDefault functions.
Guid userID = 'some user ID'
var parties = Parties.Where(p => p.BrokerID == userID).Select(p => new
{
ID = p.ID,
Title = p.Title,
Goal = p.Goal,
Groups = p.Groups,
IsBuyer = p.Clients.Any(c => c.RolesInUser.Any(r => r.Role.LoweredName == "buyer")),
IsSeller = p.Clients.Any(c => c.RolesInUser.Any(r => r.Role.LoweredName == "seller")),
Journal = p.Clients.SelectMany(c => c.Journals).OrderByDescending(j => j.OccuredOn).Select(j=> new
{
ID = j.ID,
Title = j.Title,
OccurredOn = j.OccuredOn,
SubCatTitle = j.JournalSubcategory.Title
}).FirstOrDefault(),
LastLoginDate = p.Clients.OrderByDescending(c=>c.LastLoginDate).Select(c=>c.LastLoginDate).FirstOrDefault(),
MarketingPlanCount = p.Clients.SelectMany(c => c.MarketingPlans).Count(),
Task = p.Tasks.Where(t=>t.DueDate != null && t.DueDate > DateTime.Now).OrderBy(t=>t.DueDate).Select(t=> new
{
ID = t.TaskID,
DueDate = t.DueDate,
Title = t.Title
}).FirstOrDefault(),
Clients = p.Clients.Select(c => new
{
ID = c.ID,
FirstName = c.FirstName,
MiddleName = c.MiddleName,
LastName = c.LastName,
Email = c.Email,
LastLogin = c.LastLoginDate
})
}).OrderBy(p => p.Title).ToList()
I think posting the SQL could give us some clues, as small things like the order of OrderBy coming before or after the projection could make a big difference.
But regardless, try extracting the Clients in a seperate query, this will simplify your query probably. And then include other tables like Journal and Tasks before projecting and see how this affects your query:
//am not sure what the exact query would be, and project it using ToList()
var clients = GetClientsForParty();
var parties = Parties.Include("Journal").Include("Tasks")
.Where(p=>p.BrokerID == userID).Select( p => {
....
//then use the in-memory clients
IsBuyer = clients.Any(c => c.RolesInUser.Any(r => r.Role.LoweredName == "buyer")),
...
}
)
In all cases, install EF profiler and have a look at how your query is affected. EF can be quiet surprising. Something like putting OrderBy before the projection, the same for all these FirstOrDefault or SingleOrDefault, they can all have a big effect.
And go back to the basics, if you are searching on LoweredRoleName, then make sure it is indexed so that the query is fast (even though that could be useless since EF could end up not making use of the covering index since it is querying so many other columns).
Also, since this is query is to view data (you will not alter data), don't forget to turn off Entity tracking, that will give you some performance boost as well.
And last, don't forget that you could always write your SQL query directly and project to your a ViewModel rather than anonymous type (which I see as a good practice anyhow) so create a class called PartyViewModel that includes the flatten view you are after, and use it with your hand-crafted SQL
//use your optimized SQL query that you write or even call a stored procedure
db.Database.SQLQuery("select * from .... join .... on");
I am writing a blog post about these issues around EF. The post is still not finished, but all in all, just be patient, use some of these tricks and observe their effect (and measure it) and you will reach what you want.