Performing multiple updates in EF - entity-framework

I'm currently using Entity Framework 4.3, Asp.Net 4.0, Sql Server 2008 Web.
I have method that loops through 10 to 20 loan providers. Each loan provider has a chance to bid for the loan - if they do the loop terminates - if not, they loop continues until all loan providers are consumed and a "no offers" message is returned.
I want to start logging this data - simply the lender and customer ID's. So for each customer who submits an application there maybe 1 to 20 inserts to perform. I can save these in a list and update them in 1 go.
As I'm using EF, I'm a little concerned about performance - I presume an insert will be done for each separate item added to the context.
Firstly - should I be worried about performance?
Secondly, are there any techniques I can use to make this process more efficient?
Thanks in advance.

This has nothing to do with Entity Framework. You must insert one row at a time in SQL, that's the way SQL works. Each insert is a separate statement. There is no way (outside of special bulk-insert methods, which would be pointless for 20 records) of doing it differently.
Now, yes, you can insert those 20 records in one request, which is what EF will do.
I don't understand you comment "As I'm using EF, I'm a little concerned about performance". EF is no worse performing than anything else in this respect. It's a simple 20 record insert, there's nothing special here or any complexity that could cause performance issues.

Related

Improve Hasura Subscription Performance

we developed a web app that relies on real-time interaction between our users. We use Angular for the frontend and Hasura with GraphQL on Postgres as our backend.
What we noticed is that when more than 300 users are active at the same time we experience crucial performance losses.
Therefore, we want to improve our subscriptions setup. We think that possible issues could be:
Too many subscriptions
too large and complex subscriptions, too many forks in the subscription
Concerning 1. each user has approximately 5-10 subscriptions active when using the web app. Concerning 2. we have subscriptions that are complex as we join up to 6 tables together.
The solutions we think of:
Use more queries and limit the use of subscriptions on fields that are totally necessary to be real-time.
Split up complex queries/subscriptions in multiple smaller ones.
Are we missing another possible cause? What else can we use to improve the overall performance?
Thank you for your input!
Preface
OP question is quite broad and impossible to be answered in a general case.
So what I describe here reflects my experience with optimization of subscriptions - it's for OP to decide is it reflects their situtation.
Short description of system
Users of system: uploads documents, extracts information, prepare new documents, converse during process (IM-like functionalitty), there are AI-bots that tries to reduce the burden of repetitive tasks, services that exchange data with external systems.
There are a lot of entities, a lot of interaction between both human and robot participants. Plus quite complex authorization rules: visibility of data depends on organization, departements and content of documents.
What was on start
At first it was:
programmer wrote a graphql-query for whole data needed for application
changed query to subscription
finish
It was OK for first 2-3 monthes then:
queries became more complex and then even more complex
amount of subscriptions grew
UI became lagging
DB instance is always near 100% load. Even during nigths and weekends. Because somebody did not close application
First we did optimization of queries itself but it did not suffice:
some things are rightfully costly: JOINs, existence predicates, data itself grew significantly
network part: you can optimize DB but just to transfer all needed data has it's cost
Optimization of subscriptions
Step I. Split subscriptions: subscribe for change date, query on change
Instead of complex subscription for whole data split into parts:
A. Subscription for a single field that indicates that entity was changed
E.g.
Instead of:
subscription{
document{
id
title
# other fields
pages{ # array relation
...
}
tasks{ # array relation
...
}
# multiple other array/object relations
# pagination and ordering
}
that returns thousands of rows.
Create a function that:
accepts hasura_session - so that results are individual per user
returns just one field: max_change_date
So it became:
subscription{
doc_change_date{
max_change_date
}
}
Always one row and always one field
B. Change of application logic
Query whole data
Subscribe for doc_change_date
memorize value of max_change_date
if max_change_date changed - requery data
Notes
It's absolutely OK if subscription function sometimes returns false positives.
There is no need to replicate all predicates from source query to subscription function.
E.g.
In our case: visibility of data depends on organizations and departments (and even more).
So if a user of one department creates/modifies document - this change is not visible to user of other department.
But those changes are like ones/twice in a minute per organization.
So for subscription function we can ignore those granularity and calculate max_change_date for whole organization.
It's beneficial to have faster and cruder subscription function: it will trigger refresh of data more frequently but whole cost will be less.
Step II. Multiplex subscriptions
The first step is a crucial one.
And hasura has a multiplexing of subscriptions: https://hasura.io/docs/latest/graphql/core/databases/postgres/subscriptions/execution-and-performance.html#subscription-multiplexing
So in theory hasura could be smart enough and solve your problems.
But if you think "explicit better than implicit" there is another step you can do.
In our case:
user(s) uploads documents
combines them in dossiers
create new document types
converse with other
So subscriptions becames: doc_change_date, dossier_change_date, msg_change_date and so on.
But actually it could be beneficial to have just one subscription: "hey! there are changes for you!"
So instead of multiple subscriptions application makes just one.
Note
We thought about 2 formats of multiplexed subscription:
A. Subscription returns just one field {max_change_date} that is accumulative for all entities
B. Subscription returns more granular result: {doc_change_date, dossier_change_date, msg_change_date}
Right now "A" works for us. But maybe we change to "B" in future.
Step III. What we would do differently with hasura 2.0
That's what we did not tried yet.
Hasura 2.0 allows registering VOLATILE functions for queries.
That allows creation of functions with memoization in DB:
you define a cache for function call presumably in a table
then on function call you first look in cache
if not exists: add values to cache
return result from cache
That allows further optimizations both for subscription functions and query functions.
Note
Actually it's possible to do that without waiting for hasura 2.0 but it requires trickery on postgresql side:
you create VOLATILE function that did real work
and another function that's defined as STABLE that calls VOLATILE function. This function could be registered in hasura
It works but that's trick is hard to recommend.
Who knows, maybe future postgresql versions or updates will make it impossible.
Summary
That's everything that I can say on the topic right now.
Actually I would be glad to read something similar a year ago.
If somebody sees some pitfalls - please comment, I would be glad to hear opinions and maybe alternative ways.
I hope that this explanation will help somebody or at least provoke thought how to deal with subscriptions in other ways.

Why does response time go up when the number of concurrent requests per second to my Asp.Net Core API goes up

I'm testing an endpoint under load. For 1 request per second, the average response time is around 200ms. The endpoint does a few DB lookups (all read) that are pretty fast and it's async throughout.
However when doing a few hundred requests per second (req/sec), the average response time goes up to over a second.
I've had a look at the best practices guide at:
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2
Some suggestions like "Avoid blocking calls" and "Minimize large object allocations" seem like they don't apply since I'm already using async throughout and my response size for a single request is less than 50 KB.
There are a couple though that seem like they might be useful, for example:
https://learn.microsoft.com/en-us/ef/core/what-is-new/ef-core-2.0#high-performance
https://learn.microsoft.com/en-us/aspnet/core/performance/performance-best-practices?view=aspnetcore-2.2#pool-http-connections-with-httpclientfactory
Questions:
Why would the average response time go up with an increased req/sec?
Are the suggestions above that I've marked as being 'might be useful' likely to help? I ask because while I would like to try out all, I have limited time available to me unfortunately, so I'd like to try out options that are most likely to help first.
Are there any other options worth considering?
I've had a look at these two existing threads, but neither answer my question:
Correlation between requests per second and response time?
ASP.NET Web API - Handle more requests per second
It will be hard to answer your specific issue without access to the code but the main things to consider is the size and complexity of the database queries being generated by EF. Using async/await will increase the responsiveness of your web server to start requests, but the request handling time under load will depend largely on the queries being run as the database becomes the contention point. You will want to ensure that all queries are as minimalist as possible. For example, there is a huge difference between the following 3 statements:
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.ToList()
.Where(x => x.SomeCriteriaMethod())
.ToList();
var someData = context.SomeTable.Include(x => x.SomeOtherTable)
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.ToList();
var someData = context.SomeTable
.Where(x => x.SomeField == someField && x.SomeOtherTable.SomeOtherField == someOtherField)
.Select(x => new SomeViewModel
{
SomeTableId = x.SomeTableId,
SomeField = x.SomeField,
SomeOtherField = x.SomeOtherTable.SomeOtherField
}).ToList();
Examples like the first above are extremely inefficient as they end up loading all data from the related tables from the database before filtering rows. Even though your web server may only pass back a few rows, it has requested everything from the database. These kinds of scenarios creep into apps when developers face scenarios where they want to filter on a value that EF cannot translate to SQL (such as a Function) so they solve it by putting a ToList call, or it can be introduced as a byproduct of poor separation such as a repository pattern that returns an IEnumerable.
The second example is a little better where they avoid using the read-all ToList() call, but the calls are still loading back entire rows for data that isn't necessary. This ties up resources on the database and web servers.
The third example demonstrates refining queries to just return the absolute minimum of data that the consumer needs. This can make better use of indexes and execution plans on the database server.
Other performance pitfalls that you can face under load are things like lazy loads. Databases will execute a finite number of concurrent requests, so if it turns out some queries are kicking off additional lazy load requests, when there is no load, these are executed immediately. Under load though, they are queued up along with other queries and lazy load requests which can tie down data pulls.
Ultimately you should run an SQL profiler against your database to capture the kinds and numbers of SQL queries being executed. When executing in a test environment, pay close attention to the Read count and CPU cost rather than the total execution time. As a general rule of thumb higher read and CPU cost queries will be far more susceptible to execution time blow-out under load. They require more resources to run, and "touch" more rows meaning more waiting for row/table locks.
Another thing to watch out for are "heavy" queries in very large data systems that will need to touch a lot of rows, such as reports, and in some cases, highly customizable search queries. If these should be needed, you should consider planning your database design to include a read-only replica to run reports or large search expressions against to avoid row lock scenarios in your primary database that can degrade responsiveness for the typical read and write queries.
Edit: Identifying lazy load queries.
These show up in a profiler where you query against a top level table, but then see a number of additional queries for related tables following it.
For example, say you have a table called Order, with a related table called Product, another called Customer and another called Address for a delivery address. To read all orders for a date range you'd expect to see a query such as:
SELECT [OrderId], [Quantity], [OrderDate] [ProductId], [CustomerId], [DeliveryAddressId] FROM [dbo].[Orders] WHERE [OrderDate] >= '2019-01-01' AND [OrderDate] < '2020-01-01'
Where you just wanted to load Orders and return them.
When the serializer iterates over the fields, it finds a Product, Customer, and Address referenced, and by attempting to read those fields, will trip lazy loading resulting in:
SELECT [CustomerId], [Name] FROM [dbo].[Customers] WHERE [CustomerId] = 22
SELECT [ProductId], [Name], [Price] FROM [dbo].[Products] WHERE [ProductId] = 1023
SELECT [AddressId], [StreetNumber], [City], [State], [PostCode] FROM [dbo].[Addresses] WHERE [AddressId] = 1211
If your original query returned 100 Orders, you would see potentially 100x the above set of queries, one set for each order as a lazy load hit on 1 order row would attempt to look up a related customer by customer ID, a related product by Product ID, and a related Address by Delivery Address ID. This can, and does get costly. It may not be visible when run on a test environment, but that adds up to a lot of potential queries.
If eager loaded using .Include() for the related entities, EF will compose JOIN statements to get the related rows all in one hit which is considerably faster than fetching each individual related entity. Still, that can result in pulling a lot of data you don't need. The best way to avoid this extra cost is to leverage projection through Select to retrieve just the columns you need.

Entity framework performance issue, saveChanges is very slow

Recently, I am doing some simple EF work. Very simple,
first,
List<Book> books = entity.Books.WHERE(c=>c.processed==false)ToList();
then
foreach(var book in books)
{
//DoSomelogic, update some properties,
book.ISBN = "ISBN " + randomNumber();
book.processed = true;
entity.saveChanges(book)
}
I put entity.saveChanges within the foreach because it is a large list, around 100k records, and if this record is processed without problem, then flag this record, set book.processed = true, if the process is interrupted by exception, then next time, I don't have to process these good records again.
Everything seems ok to me. It is fast when you process hundreds of records. Then when we move to 100k records, entity.saveChanges is very very slow. around 1-3 seconds per record. Then we keep the entity model, but replace entity.saveChanges with classic SqlHelper.ExecuteNonQuery("update_book", sqlparams). And it is very fast.
Could anyone tell me why entity framework process that slow? And if I still want to use entity.saveChanges, what is the best way to improve the performance?
Thank you
Turn off change tracking before you perform your inserts. This will improve your performance significantly (magnitudes of order). Putting SaveChanges() outside your loop will help as well, but turning off change tracking will help even more.
using (var context = new CustomerContext())
{
context.Configuration.AutoDetectChangesEnabled = false;
// A loop to add all your new entities
context.SaveChanges();
}
See this page for some more information.
I would take the SaveChanges(book) outside of the foreach. Since book is on the entity as a list, you can put this outside and EF will work better with the resulting code.
The list is an attribute on the entity, and EF is designed to optimize updates/creates/deletes on the back end database. If you do this, I'd be curious whether it helps.
I too may advise you to take the SaveChanges() out of the loop, as it does 'n' number of updates to the database, thus the context will have 'n' times to iterate through the checkpoints and validations required.
var books = entity.Books.Where(c => c.processed == false).ToList();
books.Foreach(b =>
{
b.ISBN = "ISBN " + randomNumber();
b.processed = true;
//DoSomelogic, update some properties
});
entity.SaveChanges();
The Entity Framework, in my opinion, is a poor choice for BULK operations both from a performance and a memory consumption standpoint. Once you get beyond a few thousand records, the SaveChanges method really starts to break down.
You can try to partition your work over into smaller transactions, but again, I think you're working too hard to create this.
A much better approach is to leverage the BULK operations that are already provided by your DBMS. SQL Server provides BULK COPY via .NET. Oracle provides BULK COPY for the Oracle.DataAccess or unmanaged data access. For Oracle.ManagedDataAccess, the BULK COPY library unfortunately isn't available. But I can create an Oracle Stored Procedure using BULK COLLECT/FOR ALL that allows me to insert thousands of records within seconds with a much lower memory footprint within your application. Within the .NET app you can implement PLSQL Associative arrays as parameters, etc.
The benefit of leveraging the BULK facilities within your DBMS is reducing the context switches between your application, the query processor and the database engine.
I'm sure other database vendors provide something similar.
"AsNoTracking" works for me
ex:
Item itemctx = ctx.Items.AsNoTracking().Single(i=>i.idItem == item.idItem);
ctx.Entry(itemctx).CurrentValues.SetValues(item);
itemctx.images = item.images;
ctx.SaveChanges();
Without "AsNoTracking" the updates is very slow.
I just execute the insert command directly.
//the Id property is the primary key, so need to have this update automatically
db.Database.ExecuteSqlCommand("SET IDENTITY_INSERT[dbo].[MyTable] ON");
foreach (var p in itemsToSave)
{
db.Database.ExecuteSqlCommand("INSERT INTO[dbo].[MyTable]([Property1], [Property2]) VALUES(#Property1, #Property2)",
new SqlParameter("#Property1", p.Property1),
new SqlParameter("#Property2", p.Property2)"
}
db.Database.ExecuteSqlCommand("SET IDENTITY_INSERT[dbo].[MyTable] OFF");
Works really quickly. EF is just impossibly slow with bulk updates, completely unusable in my case for more than a dozen or so items.
Use this Nuget package:
Z.EntityFramework.Extensions
It has extension methods that you can call on the DbContext like DbContext.BulkSaveChanges which works amazingly fast.
Note: this is NOT a free package but it has a trial period.

Doctrine: avoid collision in update

I have a product table accesed by many applications, with several users in each one. I want to avoid collisions, but in a very small portion of code I have detected collisions can occur.
$item = $em->getRepository('MyProjectProductBundle:Item')
->findOneBy(array('product'=>$this, 'state'=>1));
if ($item)
{
$item->setState(3);
$item->setDateSold(new \DateTime("now"));
$item->setDateSent(new \DateTime("now"));
$dateC = new \DateTime("now");
$dateC->add(new \DateInterval('P1Y'));
$item->setDateGuarantee($dateC);
$em->persist($item);
$em->flush();
//...after this, set up customer data, etc.
}
One option could be make 2 persist() and flush(), the first one just after the state change, but before doing it I would like to know if there is a way that offers more guarantee.
I don't think a transaction is a solution, as there are actually many other actions involved in the process so, wrapping them in a transaction would force many rollbacks and failed sellings, making it worse.
Tha database is Postgress.
Any other ideas?
My first thought would be to look at optimistic locking. The end result is that if someone changes the underlying data out from under you, doctrine will throw an exception on flush. However, this might not be easy, since you say you have multiple applications operating on a central database -- it's not clear if you can control those applications or not, and you'll need to, because they'll all need to play along with the optimistic locking scheme and update the version column when they run updates.

entity framework and some general doubts with the optimistic concurrency exception

I have some doubts about optimistic concurrency exception.
Well, For example, I retrieve some data from the database, I modify some registers and then submit the changes. If someone update the information of the registers between my request and my update I get an optimistic exception. The classic concurrency problem.
My first doubt is the following. EF to decide if the information is changed or not, retrieves the data from the database, and compare the original data that I obtain with the data that is retrieved from the database. If exists differences, then the optimistic concurrency exception is thrown.
If when I catch the optimistic concurrency exception, I decide if the client wins or the store wins. in this step EF retrieves again the information or use the data from the first retrieved? Because if retrieve again the data, it would be inefficient.
The second doubt is how to control the optimistic concurrency exception. In the catch block of code, I decide if the client wins or the store wins. If the client wins, then I call again saveChanges. But between the time that I decide that the client wins and the savechanges, other user could change the data, so I get again an optimistic concurrency exception. In theory, It could be an infinity loop.
would it be a good idea to use a transaction (scope) to ensure that the client update the information in the database? Other solution could be use a loop to try N times to update the data, if it is not possible, exit and say it to the user.
would the transaction a good idea? does it consume a lot of resources of the database? Although the transaction blocks for a moment the database, it ensures that the operation of update is finished. The loop of N times to try to complete the operation, call the database N times, and perhaps it could need more resources.
Thanks.
Daimroc.
EDIT: I forgot to ask. is it possible set the context to use client wins by default instead to wait to the concurrency exception?
My first doubt is the following. EF to decide if the information is
changed or not, retrieves the data from the database ...
It doesn't retrieve any additional data from database. It takes original values of your entity used for concurrency handling and use them in where condition of update command. The update command is followed by selecting number of modified rows. If the number is 0 it either means that record doesn't exists or somebody has changed it.
The second doubt is how to control the optimistic concurrency exception.
You simply call Refresh and SaveChanges. You can repeat pattern few times if needed. If you have so much highly concurrent application that multiple threads are fighting to update same record within fraction of seconds you most probably need to architect your data storage in different way.
Would it be a good idea to use a transaction (scope) to ensure that the client update the information in the database?
SaveChanges always uses database transaction. TransactionScope will not add you any additional value unless you want to use transaction over multiple calls to SaveChanges, distributed transaction or change for example isolation level of the transaction.
Is it possible set the context to use client wins by default instead
to wait to the concurrency exception?
It is set by default. Simply don't mark any of your properties with ConcurrencyMode.Fixed and you will have no concurrency handling = client wins.