Is JPA's flush and JDBC Batch works as same internally? - jpa

As per my understanding flush() method of JPA's entitymanager will sync the data available in persistence context with Database in a single DB network call. Thus it avoids multiple DB calls when somebody trying to persist large amount of records. Why can't I consider this as a batch equivalent (I know flush() may not be implemented for that purpose) of JDBC batch insert ? Because, JDBC batch insert also work with the same idea that it make only single DB call for all the statements it added to the statement object ?
From a performance point of view, both are comparable ? Are they work with the same technique ? Internally, at Database side both will generate same number of queries ?
Somebody please make me understand the difference.

entitymanager will sync the data available in persistence context with Database in a single DB network call
No, not at all. That isn't possible. A flush could possibly delete from several tables, insert in several tables, and update several tables. That can't be done in a single network call.
A flush can use batch statements to execute multiple similar inserts or updates though.

Related

Is optimistic locking equivalent to Select For Update?

It is my first time using EF Core and DDD concepts. Our database is Microsoft SQL Server. We use optimistic concurrency based on the RowVersion for user requests. This handles concurrent read and writes by users.
With the DDD paradigma user changes are not written directly to the database nor is the logic handled in database with a stored procedure. It is a three step process:
get aggregate from repository that pulls it from the database
update aggregate through domain commands that implement business logic
save aggregate back to repository that writes it to the database
The separation of read and write in the application logic can lead again to race conditions between parallel commands.
Since the time between read and write in the backend is normally fairly short, those race conditions can be handled with optimistic and also pessimistic locking.
To my understanding optimistic concurrency using RowVersion is sufficient for lost update problem, but not for write skew as is shown in Martin Kleppmann's book "Designing Data-Intensive Applications". This would require locking the read records.
To prevent write skew a common solution is to lock the records in step 1 with FOR UPDATE or in SQL Server with the hints UPDLOCK and HOLDLOCK.
EF Core does neither support FOR UPDATE nor SQL Server's WITH.
If I'm not able to lock records with EF Core does it mean there is no way to prevent write skew except using Raw SQL or Stored Procedures?
If I use RowVersion, I first check the RowVersion after getting the aggregate from the database. If it doesn't match I can fail fast. If it matches it is checked through EF Core in step 3 when updating the database. Is this pattern sufficient to eliminate all race conditions except write skew?
Since the write skew race condition occurs when read and write is on different records, it seems that there can always be a transaction added maybe later during development that makes a decision on a read. In a complex system I would not feel safe if it is not just simple CRUD access. Is there another solution when using EF Core to prevent write skew without locking records for update?
If you tell EF Core about the RowVersion attribute, it will use it in any update statement. BUT you have to be careful to preserve the RowVersion value from your data retrieval. The usual work pattern would retrieve the data, the user potentially edits the data, and then the user saves the data. When the user saves the data, you would normally have EF retrieve the entity, update the entity with the user's changes, and save the updates. EF uses the RowVersion in a Where clause to ensure nothing has changed since you read the data. This is the tricky part- you want to make sure the RowVersion is still the same as your initial data retrieval, not the second retrieval used to update the entity before saving.

handle sql exception for large data insert

I have a Spring 2.5 application that takes a large (275K) file and parses it. Each record is then inserted into a Postgres db. There is a unique column (not the primaryKey/#Id) that will kick out the attempted record insert. This results in a DataContraintViolationException, which seems natural enough.
The problem I have is this kills the process. Is there a good way to continue processing the entire file, and just log the exception and move onto the next record for insert? I tried wrapping the respository.save(record) in a try/catch, but it still kills the process with a transaction rollback.
A ConstraintViolationException will be wrapped in a PersistenceException and Hibernate will generally mark the transaction for rollback - even if the exception was registered to not cause a rollback at the spring transaction handling level, e.g. via #Transactional(noRollbackFor = PersistenceException.class).
So there needs to be a different solution. Some ideas:
explicitly look whether a corresponding row is already present (one additional select per item)
try every insert in a dedicated transaction (e.g. annotating a corresponding service method with #Transactional(propagation = Propagation.REQUIRES_NEW) (one additional transaction per item)
handle the constraint violation in a custom DB statement (e.g. ON CONFLICT DO NOTHING / other "upsert" / "merge" behavior the DB offers)
The 1st and the 2nd option should offer some potential for parallelization, since selects / inserts can be issued independently from each other and there is no need to wait for unrelated DB roundtrips.
The 3rd option could be the fastest, as it requires no selects, the least amount of DB roundtrips, and statements could be batched; however it probably also needs the most amount of custom setup: Spring JPA bulk upserts is slow (1,000 entities took 20 seconds) (Reporting back which number or even which entities were actually inserted would likely even increase the complexity: How can I get the INSERTED and UPDATED rows for an UPSERT operation in postgres)

EF consistency between two or more reads

In this page in Microsoft's documentation on EF it is stated literally
Entity Framework does not wrap queries in a transaction
If I am right, this means that sql reads are not implied with transactions and thus every select in our code is executed independently. But if this is so, can we ensure that two reads are consistent between each other? In the typical scenario, is there a warranty that the sum of the loaded amount of A and the loaded amount of B will be right (in some connection) if a transfer between A and B is started (in a different connection) between the read of A and the read of B? Would Entity Framework be able to solve this case in some way?
The built-in solution in EF is client-side optimistic concurrency. On update EF will build a query that ensures that the row to be updated has not been changed since it was read.
Properties configured as concurrency tokens are used to implement
optimistic concurrency control: whenever an update or delete operation
is performed during SaveChanges, the value of the concurrency token on
the database is compared against the original value read by EF Core.
If the values match, the operation can complete. If the values do not
match, EF Core assumes that another user has performed a conflicting
operation and aborts the current transaction.
You can also opt in to Transactions at whatever isolation level you choose, which may provide similar protections. Or use Raw SQL queries with lock hints for your target database.

Is it possible to configure Hibernate for flush only but never commit ( A kind of commit simulation)

I need to migrate from an old postgreSql database with an old schema (58 tables) to a new database with a new schema (40 tables). The patterns are completely different.
It is not a simple migration (copy and paste). But rather a copy-transform-paste.
I decided to write a batch and use spring batch, spring data and jpa. So I have two dataSources and a chainedTransaction. My config spring is mainly made up of chunck Task with a JpaPagingItemReader and an ItemWriterAdapter.
For performance needs, I also configured Partitioner which allows me to partition my source tables into several sub-tables and a chunckSize = 500000
Everything works smoothly. But considering the size of my old table it takes me a week to migrate all the data.
I will want to do a test which will consist of running my Batch without committing. Just that hibernate generates all sql requests in a ".sql" file, but does not commit the data to the database.
This will allow me to see if the commit is costly in execution time.
Is it possible to configure hibernate to flush only but never commit? A kind of commit simulation ?
Thank's
Usually, the costly part is foreign key and unique key checks as well as index maintenance, but since you don't write how you fetch data, it could very well be the case that you are accessing your data in an inefficient manner.
In general, I would recommend you to create a dump with pg_dump, restore that and then try to do the migration in an SQL only way. This way, no data has to flow around but can stay on the machine which is generally much more efficient.

Why does fetching from a DB cursor returns always the same result set with MyBatis and Spring Transaction

My setup is Postgres database which is connected via JDBC driver to a Tomcat server (which is responsible for connection pooling), which again serves this data source via JNDI to an Spring application.
In the java application I use MyBatis and MyBatis-Spring for querying the database.
Now I want to page through a table using a cursor as shown in this simple example http://www.postgresql.org/docs/9.3/static/sql-fetch.html.
Since a cursor needs to be run within a DB transaction I annotated the relevant method with #transactional annotation provided by the Spring DataSourceTransactionManager (see http://mybatis.github.io/spring/transactions.html)
This is where the crazy part starts. On runtime every FETCH FORWARD 1000 FROM CURSOR queried by MyBatis mapper does return one and the same result set. So it seems the cursor position gets rolled back on every call. So it will return the first 1000 rows of the table avery time.
Why do the following fetches do not return the next chunks of records?
I figured out that MyBatis uses a cache mechanism which isn't quite intelligent in my eyes https://mybatis.github.io/mybatis-3/configuration.html.
In fact MyBatis by default caches all queries executed during a session. Session means transaction or per connection. So on AutoCommit this is no problem. But not for use with a cursor where the fetch statement does not change within a transaction.
So once the first data from the DB has been fetched from the cursor the result was cached in memory and no following fetches were queried to the DB.
The solution is the following line in the mybatis-config.xml
<setting name="localCacheScope" value="STATEMENT"/>
So the local session will be used just for statement execution, no data will be shared between two different calls to the same SqlSession.
For me it seems like a bug since the default caching scope makes no sense for DB cursors.