In one of my project, Entity Framework in used.
I came across a situation where performance is terrible. When there are multiple records that need to be inserted into a table one by one, about 50 ~ 500 not very sure but very huge.
At first, I used:
dbcontext.Adds.Add(alist);
to do the insert. But soon I found out once there exist even 1 object that is has invalid data and could not insert into database correctly, none of the data could be inserted! All data related to each other and could not bypass any incorrect ones.
Here is my solution:
foreach(var a in alist)
{
//...
try
{
dbcontext.Adds.Add(a);
dbcontext.SaveChanges();
}
catch (Exception ex)
{
// just log and bypass one by one
log4net.LogManager.GetLogger("NOTSAVE").Info(a.ToString());
}
}
Now, it works properly.
But there is a big issue that the performance is terrible! Client may wait for several seconds for each action. And custom feed back is also worse.
Does anyone know any other solution to improve performance? It's better to complete in 500ms ~ 1second. But now once the records > 100 it may cost more than 1 second each time. Obvious and frequent pause with customer each operate via UI as the result.
The First and important thing you can do is turning off AutoDetectChangesEnabled, it has huge impact on performance.
dbcontext.Configuration.AutoDetectChangesEnabled = false;
foreach(var a in alist)
{
//...
try
{
dbcontext.Adds.Add(a);
dbcontext.SaveChanges();
}
catch (Exception ex)
{
// just log and bypass one by one
log4net.LogManager.GetLogger("NOTSAVE").Info(a.ToString());
}
}
dbcontext.Configuration.AutoDetectChangesEnabled = true;
read this article for more info: EntityFramework Performance and AutoDetectChanges
You can use third party libraries:
Entity Framework Extensions
EntityFramework.Utilities
and you can use System.Data.SqlClient.SqlBulkCopy inADO.NET which is really fast.
Related
I know there are similar questions here but they are either telling me to switch back to regular RDBMS systems if I need transactions or use atomic operations or two-phase commit. The second solution seems the best choice. The third I don't wish to follow because it seems that many things could go wrong and I can't test it in every aspect. I'm having a hard time refactoring my project to perform atomic operations. I don't know whether this comes from my limited viewpoint (I have only worked with SQL databases so far), or whether it actually can't be done.
We would like to pilot test MongoDB at our company. We have chosen a relatively simple project - an SMS gateway. It allows our software to send SMS messages to the cellular network and the gateway does the dirty work: actually communicating with the providers via different communication protocols. The gateway also manages the billing of the messages. Every customer who applies for the service has to buy some credits. The system automatically decreases the user's balance when a message is sent and denies the access if the balance is insufficient. Also because we are customers of third party SMS providers, we may also have our own balances with them. We have to keep track of those as well.
I started thinking about how I can store the required data with MongoDB if I cut down some complexity (external billing, queued SMS sending). Coming from the SQL world, I would create a separate table for users, another one for SMS messages, and one for storing the transactions regarding the users' balance. Let's say I create separate collections for all of those in MongoDB.
Imagine an SMS sending task with the following steps in this simplified system:
check if the user has sufficient balance; deny access if there's not enough credit
send and store the message in the SMS collection with the details and cost (in the live system the message would have a status attribute and a task would pick up it for delivery and set the price of the SMS according to its current state)
decrease the users's balance by the cost of the sent message
log the transaction in the transaction collection
Now what's the problem with that? MongoDB can do atomic updates only on one document. In the previous flow it could happen that some kind of error creeps in and the message gets stored in the database but the user's balance is not updated and/or the transaction is not logged.
I came up with two ideas:
Create a single collection for the users, and store the balance as a field, user related transactions and messages as sub documents in the user's document. Because we can update documents atomically, this actually solves the transaction problem. Disadvantages: if the user sends many SMS messages, the size of the document could become large and the 4MB document limit could be reached. Maybe I can create history documents in such scenarios, but I don't think this would be a good idea. Also I don't know how fast the system would be if I push more and more data to the same big document.
Create one collection for users, and one for transactions. There can be two kinds of transactions: credit purchase with positive balance change and messages sent with negative balance change. Transaction may have a subdocument; for example in messages sent the details of the SMS can be embedded in the transaction. Disadvantages: I don't store the current user balance so I have to calculate it every time a user tries to send a message to tell if the message could go through or not. I'm afraid this calculation can became slow as the number of stored transactions grows.
I'm a little bit confused about which method to pick. Are there other solutions? I couldn't find any best practices online about how to work around these kinds of problems. I guess many programmers who are trying to become familiar with the NoSQL world are facing similar problems in the beginning.
As of 4.0, MongoDB will have multi-document ACID transactions. The plan is to enable those in replica set deployments first, followed by the sharded clusters. Transactions in MongoDB will feel just like transactions developers are familiar with from relational databases - they'll be multi-statement, with similar semantics and syntax (like start_transaction and commit_transaction). Importantly, the changes to MongoDB that enable transactions do not impact performance for workloads that do not require them.
For more details see here.
Having distributed transactions, doesn't mean that you should model your data like in tabular relational databases. Embrace the power of the document model and follow the good and recommended practices of data modeling.
Check this out, by Tokutek. They develop a plugin for Mongo that promises not only transactions but also a boosting in performance.
Bring it to the point: if transactional integrity is a must then don't use MongoDB but use only components in the system supporting transactions. It is extremely hard to build something on top of component in order to provide ACID-similar functionality for non-ACID compliant components. Depending on the individual usecases it may make sense to separate actions into transactional and non-transactional actions in some way...
Now what's the problem with that? MongoDB can do atomic updates only on one document. In the previous flow it could happen that some kind of error creeps in and the message gets stored in the database but the user's balance is not gets reduced and/or the transaction is not gets logged.
This is not really a problem. The error you mentioned is either a logical (bug) or IO error (network, disk failure). Such kind of error can leave both transactionless and transactional stores in non-consistent state. For example, if it has already sent SMS but while storing message error occurred - it can't rollback SMS sending, which means it won't be logged, user balance won't be reduced etc.
The real problem here is the user can take advantage of race condition and send more messages than his balance allows. This also applies to RDBMS, unless you do SMS sending inside transaction with balance field locking (which would be a great bottleneck). As a possible solution for MongoDB would be using findAndModify first to reduce the balance and check it, if it's negative disallow sending and refund the amount (atomic increment). If positive, continue sending and in case it fails refund the amount. The balance history collection can be also maintained to help fix/verify balance field.
The project is simple, but you have to support transactions for payment, which makes the whole thing difficult. So, for example, a complex portal system with hundreds of collections (forum, chat, ads, etc...) is in some respect simpler, because if you lose a forum or chat entry, nobody really cares. If you, on the otherhand, lose a payment transaction that's a serious issue.
So, if you really want a pilot project for MongoDB, choose one which is simple in that respect.
Transactions are absent in MongoDB for valid reasons. This is one of those things that make MongoDB faster.
In your case, if transaction is a must, mongo seems not a good fit.
May be RDMBS + MongoDB, but that will add complexities and will make it harder to manage and support application.
This is probably the best blog I found regarding implementing transaction like feature for mongodb .!
Syncing Flag: best for just copying data over from a master document
Job Queue: very general purpose, solves 95% of cases. Most systems need to have at least one job queue around anyway!
Two Phase Commit: this technique ensure that each entity always has all information needed to get to a consistent state
Log Reconciliation: the most robust technique, ideal for financial systems
Versioning: provides isolation and supports complex structures
Read this for more info: https://dzone.com/articles/how-implement-robust-and
This is late but think this will help in future. I use Redis for make a queue to solve this problem.
Requirement:
Image below show 2 actions need execute concurrently but phase 2 and phase 3 of action 1 need finish before start phase 2 of action 2 or opposite (A phase can be a request REST api, a database request or execute javascript code...).
How a queue help you
Queue make sure that every block code between lock() and release() in many function will not run as the same time, make them isolate.
function action1() {
phase1();
queue.lock("action_domain");
phase2();
phase3();
queue.release("action_domain");
}
function action2() {
phase1();
queue.lock("action_domain");
phase2();
queue.release("action_domain");
}
How to build a queue
I will only focus on how avoid race conditon part when building a queue on backend site. If you don't know the basic idea of queue, come here.
The code below only show the concept, you need implement in correct way.
function lock() {
if(isRunning()) {
addIsolateCodeToQueue(); //use callback, delegate, function pointer... depend on your language
} else {
setStateToRunning();
pickOneAndExecute();
}
}
function release() {
setStateToRelease();
pickOneAndExecute();
}
But you need isRunning() setStateToRelease() setStateToRunning() isolate it's self or else you face race condition again. To do this I choose Redis for ACID purpose and scalable.
Redis document talk about it's transaction:
All the commands in a transaction are serialized and executed
sequentially. It can never happen that a request issued by another
client is served in the middle of the execution of a Redis
transaction. This guarantees that the commands are executed as a
single isolated operation.
P/s:
I use Redis because my service already use it, you can use any other way support isolation to do that.
The action_domain in my code is above for when you need only action 1 call by user A block action 2 of user A, don't block other user. The idea is put a unique key for lock of each user.
Transactions are available now in MongoDB 4.0. Sample here
// Runs the txnFunc and retries if TransientTransactionError encountered
function runTransactionWithRetry(txnFunc, session) {
while (true) {
try {
txnFunc(session); // performs transaction
break;
} catch (error) {
// If transient error, retry the whole transaction
if ( error.hasOwnProperty("errorLabels") && error.errorLabels.includes("TransientTransactionError") ) {
print("TransientTransactionError, retrying transaction ...");
continue;
} else {
throw error;
}
}
}
}
// Retries commit if UnknownTransactionCommitResult encountered
function commitWithRetry(session) {
while (true) {
try {
session.commitTransaction(); // Uses write concern set at transaction start.
print("Transaction committed.");
break;
} catch (error) {
// Can retry commit
if (error.hasOwnProperty("errorLabels") && error.errorLabels.includes("UnknownTransactionCommitResult") ) {
print("UnknownTransactionCommitResult, retrying commit operation ...");
continue;
} else {
print("Error during commit ...");
throw error;
}
}
}
}
// Updates two collections in a transactions
function updateEmployeeInfo(session) {
employeesCollection = session.getDatabase("hr").employees;
eventsCollection = session.getDatabase("reporting").events;
session.startTransaction( { readConcern: { level: "snapshot" }, writeConcern: { w: "majority" } } );
try{
employeesCollection.updateOne( { employee: 3 }, { $set: { status: "Inactive" } } );
eventsCollection.insertOne( { employee: 3, status: { new: "Inactive", old: "Active" } } );
} catch (error) {
print("Caught exception during transaction, aborting.");
session.abortTransaction();
throw error;
}
commitWithRetry(session);
}
// Start a session.
session = db.getMongo().startSession( { mode: "primary" } );
try{
runTransactionWithRetry(updateEmployeeInfo, session);
} catch (error) {
// Do something with error
} finally {
session.endSession();
}
I have heard that when committing transactions using an Entity Manager, it is good practice to try again if the commit fails, since it may be an issue where the object was changed while the transaction was processing.
Does this seem like a proper retry implementation?
int loopCount = 1;
boolean transactionCommited = false;
while(!transactionCommited && loopCount <3) {
EntityManager em = EMF.getInstance().getEntityManager();
try{
EntityTransaction tx = em.getTransaction();
tx.begin();
Player playerToEdit = em.find(Player.class, id);
playerToEdit.setLastName(lastName);
tx.commit();
transactionCommitted = true;
} catch(Exception e){
if(loopCount == 2){
//throw an exception, retry already occurred?
}
} finally{
if(tx.isActive()){
tx.rollback();
}
em.close();
}
loopCount++;
}
As you are caching "Exception" in your catch block, you are retrying to update in situations when the update can't be done. For example, if the Entity can't be found in the database, you are trying to find it twice.
You should catch the most specific exception "OptimisticLockException". This exception is thrown when the version of the Entity doesn't match with the version stored in the database. A Version field in the Entity is a requirement to implement this locking strategy.
There are other locking strategies to be used in high concurrent applications but most of the time a Optimistic Locking strategy is the most appropriate.
As a small detail, using constants for the number of retries instead of "magic numbers" improves code readability and it's easier to modify the number of retries in the future.
Generally it is a bad idea to try again to submit the same changes. Even if the thrown exception is OptimisticLockException, that is not a good idea, as that could mean the someone is overwriting changes that someone has made. Imagine the following scenario:
User 1 changes entityX and commits it.
User 2 changes the a part of fields of the same entityX and tries to commit it. EntityManager throws an exception.
The correct scenario would be here to indicate the exception to the User, so that he re-reads the entity and try again the same modifications.
And now the most important argument why this is dangerous:
At least Hibernate is known that BAD things could happen if you try to reuse the EntityManager after it throwed an exception. Before data gets corrupt or your applicaton stops working as you want, take a look at this article or this one.
Our e-commerce application built on ATG, has provision whereby multiple users can update the same Order. Since the cache mode for Order is Simple - this has resulted in large number of ConcurrentUpdateException and InvalidVersionException. We were considering locked cache mode, however are skeptical about using locked caching as the Orders are being updated very frequently and locking might result in deadlocks and have its own performance implications.
Is there a way we can continue using simple cache mode and minimize the occurances of ConcurrentUpdateException and InvalidVersionException?
My experience has been that you have to use locked caching with orders on any medium to high volume ATG websites.. Also, remember that the end-user experience is bad when this happens as they either get an error message (if the error handling is good) or they get something like an "internal server error" error.
The reason I believe you need to use locked caching for order is:
You can't guarantee that a user has not got multiple sessions open at the same time which are updating the shopping cart (which is just an incomplete Order). I have also seen examples where customers share their logins with family members etc and then wonder why all these items keep magically appearing in their shopping cart.
There are a number of processes which update the order including things like scenarios and customer service agents using the CSC module.
You could have code which updates orders in a non-safe way.
Some things which might help include:
Always use the OrderManager to load/update an order. Sounds obvious but I have seen a lot of updating orders via the repository.
Make sure that any updates are inside a transaction block.
Try to consolidate any background processes which might update orders to run on a small subset of your ATG instances (this will help reduce concurrency)
The ATG help has this to say about it:
A multi-server application might require locked caching, where only one Oracle ATG Web Commerce instance at a time has write access to the cached data of a given item type. You can use locked caching to prevent multiple servers from trying to update the same item simultaneously—for example, Commerce order items, which can be updated by customers on an external-facing server and by customer service agents on an internal-facing server. By restricting write access, locked caching ensures a consistent view of cached data among all Oracle ATG Web Commerce instances.
That said converting to locked caching will most certainly require performance testing and tuning of the order repository caches. It can and does result in deadlocks (seen that many times) but if configured correctly the deadlocks are infrequent.
Not sure what version of ATG you are using but for 10.2 there is a good explanation here of how you can get everything "in sync".
There is actually a Best Practices approach that was recommended in Legacy ATG Community long time ago. Just pasting it here.
When you are using the Order object with synchronization and transactions, there is a specific usage pattern that is critical to follow. Not following the expected pattern can lead to unnecessary ConcurrentUpdateExceptions, InvalidVersionExceptions, and deadlocks. The following sequence must be strictly adhered to in your code:
Obtain local-lock on profile ID.
Begin Transaction
Synchronize on Order
Perform ALL modifications to the order object.
Call OrderManager.updateOrder.
End Synchronization
End Transaction.
Release local-lock on profile ID.
Steps 1, 2, 7, 8 are done for you in the beforeSet() and afterSet() methods for ATG form handlers where order updates are expected. These include form handlers that extend PurchaseProcessFormHandler and OrderModifierFormHandler (deprecated). If your code accesses/modifies the order outside of a PurchaseProcessFormHandler, it will likely need to obtain the local-lock manually. The lock fetching can be done using the TransactionLockService.
So, if you have extended an ATG form handler based on PurchaseProcessFormHandler, and have written custom code in a handleXXX() method that updates an order, your code should look like:
synchronized( order )
{
// Do order updates
orderManager.updateOrder( order );
}
If you have written custom code updating an order outside of a PurchaseProcessFormHandler (e.g. CouponFormHandler, droplet, pipeline servlet, fulfillment-related), your code should look like:
ClientLockManager lockManager = getLocalLockManager(); // Should be configured as /atg/commerce/order/LocalLockManager
boolean acquireLock = false;
try
{
acquireLock = !lockManager.hasWriteLock( profileId, Thread.currentThread() );
if ( acquireLock )
lockManager.acquireWriteLock( profileId, Thread.currentThread() );
TransactionDemarcation td = new TransactionDemarcation();
td.begin( transactionManager );
boolean shouldRollback = false;
try
{
synchronized( order )
{
// do order updates
orderManager.updateOrder( order );
}
}
catch ( ... e )
{
shouldRollback = true;
throw e;
}
finally
{
try
{
td.end( shouldRollback );
}
catch ( Throwable th )
{
logError( th );
}
}
}
finally
{
try
{
if ( acquireLock )
lockManager.releaseWriteLock( profileId, Thread.currentThread(), true );
}
catch( Throwable th )
{
logError( th );
}
}
This pattern is only useful to prevent ConcurrentUpdateExceptions, InvalidVersionExceptions, and deadlocks when multiple threads attempt to update the same order on the same ATG instance. This should be adequate for most situations on a commerce site since session stickiness will confine updates to the same order to the same ATG instance.
I have two apps: one app is asp.net and another is a windows service running in background.
The windows service running in background is performing some tasks (read and update) on database while user can perform other operations on database through asp.net app. So I am worried about it as for example, in windows service I collect some record that satisfy a condition and then I iterate over them, something like:
IQueryable<EntityA> collection = context.EntitiesA.where(<condition>)
foreach (EntityA entity in collection)
{
// do some stuff
}
so, if user modify a record that is used later in the loop iteration, what value for that record is EF taken into account? the original retrieved when performed:
context.EntitiesA.where(<condition>)
or the new one modified by the user and located in database?
As far as I know, during iteration, EF is taken each record at demand, I mean, one by one, so when reading the next record for the next iteration, this record corresponds to that collected from :
context.EntitiesA.where(<condition>)
or that located in database (the one the user has just modified)?
Thanks!
There's a couple of process that will come into play here in terms of how this will work in EF.
Queries are only performed on enumeration (this is sometimes referred to as query materialisation) at this point the whole query will be performed
Lazy loading only effects navigation properties in your above example. The result set of the where statement will be pulled down in one go.
So what does this mean in your case:
//nothing happens here you are just describing what will happen later to make the
// query execute here do a .ToArray or similar, to prevent people adding to the sql
// resulting from this use .AsEnumerable
IQueryable<EntityA> collection = context.EntitiesA.where(<condition>);
//when it first hits this foreach a
//SELECT {cols} FROM [YourTable] WHERE [YourCondition] will be performed
foreach (EntityA entity in collection)
{
//data here will be from the point in time the foreach started (eg if you have updated during the enumeration in the database you will have out of date data)
// do some stuff
}
If you're truly concerned that this can happen then get a list of id's up front and process them individually with a new DbContext for each (or say after each batch of 10). Something like:
IList<int> collection = context.EntitiesA.Where(...).Select(k => k.id).ToList();
foreach (int entityId in collection)
{
using (Context context = new Context())
{
TEntity entity = context.EntitiesA.Find(entityId);
// do some stuff
context.Submit();
}
}
I think the answer to your question is 'it depends'. The problem you are describing is called 'non repeatable reads' an can be prevented from happening by setting a proper transaction isolation level. But it comes with a cost in performance and potential deadlocks.
For more details you can read this
I'm using grails as a poor man's etl tool for migrating some relatively small db objects from 1 db to the next. I have a controller that reads data from one db (mysql) and writes it into another (pgsql). They use similar domain objects, but not exactly the same ones due to limitations in the multi-datasource support in grails 2.1.X.
Below you'll see my controller and service code:
class GeoETLController {
def zipcodeService
def migrateZipCode() {
def zc = zipcodeService.readMysql();
zipcodeService.writePgSql(zc);
render{["success":true] as JSON}
}
}
And the service:
class ZipcodeService {
def sessionFactory
def propertyInstanceMap = org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP
def readMysql() {
def zipcode_mysql = Zipcode.list();
println("read, " + zipcode_mysql.size());
return zipcode_mysql;
}
def writePgSql(zipcodes) {
List<PGZipcode> zips = new ArrayList<PGZipcode>();
println("attempting to save, " + zipcodes.size());
def cntr = 0;
zipcodes.each({ Zipcode zipcode ->
cntr++;
def props = zipcode.properties;
PGZipcode zipcode_pg = new PGZipcode(zipcode.properties);
if (!zipcode_pg.save(flush:false)) {
zipcode_pg.errors.each {
println it
}
}
zips.add(zipcode_pg)
if (zips.size() % 100 == 0) {
println("gorm begin" + new Date());
// clear session here.
this.cleanUpGorm();
println("gorm complete" + new Date());
}
});
//Save remaining
this.cleanUpGorm();
println("Final ." + new Date());
}
def cleanUpGorm() {
def session = sessionFactory.currentSession
session.flush()
session.clear()
propertyInstanceMap.get().clear()
}
}
Much of this is taken from my own code and then tweaked to try and get similar performance as seen in http://naleid.com/blog/2009/10/01/batch-import-performance-with-grails-and-mysql/
So, in reviewing my code, whenever zipcode_pg.save() is invoked, an insert statement is created and sent to the database. Good for db consistency, bad for bulk operations.
What is the cause of my instant flushes (note: My datasource and congig groovy files have NO relevant changes)? At this rate, it takes about 7 seconds to process each batch of 100 (14 inserts per second), which when you are dealing with 10,000's of rows, is just a long time...
Appreciate the suggestions.
NOTE: I considered using a pure ETL tool, but with so much domain and service logic already built, figured using grails would be a good reuse of resources. However, didn't imagine this quality of bulk operations
Without seeing your domain objects, this is just a hunch, but I might try specifying validate:false as well in your save() call. Validate() is called by save(), unless you tell Grails not to do that. For example, if you have a unique constraint on any field in your PGZipcode domain object, Hibernate has to do an insert on every new record to leverage the DBMS's unique function and perform a proper validation. Other constraints might require DBMS queries as well, but only unique jumps to mind right now.
From Grails Persistence: Transaction Write-Behind
Hibernate caches database updates where possible, only actually
pushing the changes when it knows that a flush is required, or when a
flush is triggered programmatically. One common case where Hibernate
will flush cached updates is when performing queries since the cached
information might be included in the query results. But as long as
you're doing non-conflicting saves, updates, and deletes, they'll be
batched until the session is flushed.
Alternately, you might try setting the Hibernate session's flush mode explicitly:
sessionFactory.currentSession.setFlushMode(FlushMode.MANUAL);
I'm under the impression the default flush mode might be AUTO.