Currently I'm working on a software that consists of two parts.
One part is a kind of company wide framework for data processing (kind of self written process engine). It uses JTA UserTransactions and calls a sub processor that is written by our project.
Our "sub processor" is an application on it's own. It uses container managed persistence via JPA. (Websphere with OpenJPA)
A typical workflow is:
process engine loads process data -> start user transaction -> calls sub processor -> writes process data -> ends user transaction
We now experience the following wrong behavior:
The user transaction is committed in the process engine, all the meta data of the process is stored into the db BUT the data the entity manager holds inside our sub processor application is not written to the db.
Is there some manual communication necessary to commit the content of our entity manager?
The Problem we observed had nothing to do with JTA and transactions.
We tried to clean a blob column but this was not possible. I will create another question for it.
Related
Why is persistence context called persistence context?
Is it context because it acts as a stepping stone until the object is permanently stored in db?
(Because context is a dynamic feeling from where to where rather than a static feeling?)
A context in computing terms is defined as follows:
In computer science, a task context is the minimal set of data used by a task that must be saved to allow a task to be interrupted, and later continued from the same point.
Context_(computing)
A persistence context is a specific context relating to database persistence. Just like any other context it will store required state relating to database persistence.
Is it context because it acts as a stepping stone until the object is permanently stored in db?
JPA works with transactions, sometimes this can appear hidden if using web frameworks that automatically begin and end the transaction for an http request. The persistence context will act as a kind of cache during a transaction storing any database reads. Any updates made are also saved to the context until the transaction is finished or you manually flush, at which point they will be persisted to the database.
I came across an old project using OpenJPA with DB2 running on Websphere Liberty 18. In the persistence.xml file there is a persistent unit with the following declaration:
<persistence-unit name="my-pu" transaction-type="RESOURCE_LOCAL">
<provider>org.apache.openjpa.persistence.PersistenceProviderImpl</provider>
<jta-data-source>jdbc/my-data-source</jta-data-source>
</persistence-unit>
In the case that we are using RESOURCE_LOCAL transactions and there is code to manually manage the transactions scattered throughout the whole application, shouldn't the data source be declared as "non-jta-data-source"? Interestingly it seems the application is working fine despite that. Any ideas why it works fine?
<non-jta-data-source> specifies a data source that refuses to enlist in JTA transactions. This means, if you do userTransaction.begin (or take advantage of any API by which the container starts a transaction for you), and you perform some operations on the data source (which is marked with transactional="false" in Liberty) those operations will not be part of the encompassing JTA transaction and can be committed or rolled back independently. It's definitely an advanced pattern, and if you don't know what you are doing, or temporarily forget that the data source doesn't enlist, you can end up writing code that corrupts your data. At this point, you may be wondering why JPA even has such an option. I expect it isn't intended for the end user's usage of JPA programming model at all, but is really for the JPA persistence provider (Hibernate/EclipseLink/OpenJPA) implementation. For example, if you consider the scenario where a JTA transaction is active on the thread and you perform an operation via JPA where the JPA persistence provider needs to generate a unique key for you, and the persistence provider needs to run some database command to reserve the next block of unique keys, the JPA persistence provider can't just do that within your transaction because you might end up rolling it back, and then the same block of unique keys could be given out twice and errors would occur. The JPA persistence provider really needs to suspend your transaction, run its own transaction, and then resume yours. In my opinion suspend/resume would have been the natural solution here, but the JTA spec doesn't provide a standard way to obtain the TransactionManager, and so my guess is that the JPA spec invented its own solution for situations like this of requiring a data source that bypasses transaction enlistment as an alternative. A JPA provider can run its own transactional operations on the non-jta-data-source while your JTA transaction continues on unimpacted by it. You'll also notice with the example I chose, that it doesn't apply to a number of paths through JPA. If your JPA entity is configured to have the database generate the unique keys instead, then the persistence provider doesn't need to perform its own database operations on a non-jta-data-source. If you aren't using JTA transactions, then the persistence provider doesn't need to worry about enlisting in your transaction because it can just use a different connection, so it doesn't need a non-jta-data-source there either.
I´m getting PK Violation Exception when using EF Core 2.1 DbContext in an Azure QueueTrigger function. Guess is due to the nature of DbContext not being thread-safe, and the Azure Function running different instances in parallel. I have read quite a few, but I can´t find a good approach to solve this.
Here is my scenario (producer-consumer pattern):
I have a Scheduled Azure Function that is calling an API to get Projects from different external systems. To get all the required info for a project, I need to run different Queries to other external services, so I´m decoupling this to another Azure function, so the Scheduled function just queues a message per Project, as “Sync Project ID 101”.
Another QueueTrigger Function fires every time a message is queued, so, it means different instances running in parallel. This function must gather all the data of a specific Project, and that means more calls to other external services / APIs, to (some kind of) aggregate all the info about a Project. IMHO it´s good to do it that way, as I can process multiple Projects in parallel, and I can scale the Function if I need it.
Once I have all this Project info, I want to persist it in a SQL DB using EF Core (and here comes the issue)
Project data includes Users in the Project, and each user have a specific GUID as PK (coming from the external system). That means I can have repeated Users IDs in different Function instances, and here is the problem, as when I try to persist User info in a SQL Table, I can get PK Duplication exception, as multiple Function instances can try to Insert the same User at the same time (when the instance A check if user exists, it gets False, but another instance B is actually adding this User, so when instance A tries the Insert, it fails).
Guess I can lock DbContext somehow, but not sure if is good, as I also have a website doing Queries to the SQL DB (read-only queries for now, but could be updates in future too).
Another idea could be to send the entire Project info to another Queue / Blob file, and have another function in Singleton mode that Insert the data into SQL.
I´ve created this project simplifying my scenario, but enough to reproduce the issue and understand the problem.
https://github.com/luismanez/queuetrigger-efcore-multithreading
Any other ideas or recommended approaches? (open to change the architecture if find something better)
Many thanks!
A "more easy" way could be to do some kind of upsert in the database. There is a sample of how to do that with EF Core: https://www.flexlabs.org/2018/02/adding-upsert-support-for-entity-framework-core
From the Spring Guides:
For starters, the #EnableBatchProcessing annotation adds many critical
beans that support jobs and saves you a lot of leg work. This example
uses a memory-based database (provided by #EnableBatchProcessing),
meaning that when it’s done, the data is gone.
How can I make the execution state backed by a database (or some other persistent record) so that, in case the application crashes, the job is resumed from the previous state?
My solution, until now, is having my ItemReader be an JdbcCursorItemReader which reads records from a table whose column X is not NULL, and my ItemWriter be a JdbcBatchItemWriter which updates the record with data on column X, making it non-null (so that it won't be picked on the next execution). However, this seems really hackish and I believe there's a more elegant way. Can anyone please shed some light?
When using the #EnableBatchProcessing annotation, if you provide a DataSource bean definition called dataSoure, Spring Batch will use that database for the job repository store instead of the in memory map. You can read more about this functionality in the documentation here: http://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/core/configuration/annotation/EnableBatchProcessing.html
I have created several custom activities that update tables in my DB (in this case SQL Server Compact), using Entity Framework 4 with POCOs.
If I put more than one of these inside a WF4 TransactionScope activity, I'm running into problems: EF disposes the DB connection after the first activity has finished, and when the next DB activity tries to do a DB update a new connection is built up. At this moment an exception is thrown.
System.Activities.WorkflowApplicationAbortedException : The workflow has been aborted.
----> System.Data.EntityException : The underlying provider failed on Open.
----> System.InvalidOperationException : The connection object can not be enlisted in transaction scope.
Do I have to keep the EF connection open during the whole transaction scope? How can I do that? Create an explicit custom activity for that, or is there a standard way?
My current workaround goes like this: I created a new code activity that creates our ObjectContext and explicitely calls dbContext.Connection.Open(). It returns the ObjectContext, which is then saved in a workflow variable. That one is passed to all the DB related activities as an InArgument<>. Inside my DB activities, I use this ObjectContext if it is passed in, otherwise I create a new one.
This does work, but I'm not satisfied with this solution: It needs the new InArgument for every DB related activity. In the workflow designer, I have to insert that special OpenDatabaseConnection activity inside the transaction scope, and then make sure that the correct variable is passed into all DB activities. This seems to be very inelegant and error prone, especially if other team members have to use these DB activities.
What would be a better way to handle this?
The problem is that when you open a second connection in the same transaction scope, an attempt is made to promote the transaction to a distributed transaction (even though there's nothing distributed about it since you connect to the same database). SQL Server CE doesn't support this scenario.
What I would do is create a custom 'container' activity that opens (and closes) the connection and makes it available to child activities. This is still not optimal but at least you no longer need to pass InArgument's around. You get the following activity tree:
TransactionScope
InitializeConnection
Sequence
CustomDataActivity1
CustomDataActivity2
CustomDataActivity3
InitializeConnection is a NativeActivity that uses NativeActivityContext.Properties to expose the connection (or the ObjectContext) to child activities.
Make sure you implement proper error handling to ensure you close the connection at all times.
NOTE: Distributed transactions are supported by the full SQL Server only through a Windows service called MSDTC (Microsoft Distributed Transaction Coordinator). You can find this one in your 'Local Services'. Since SQL Server CE is a database that should be able to operate completely standalone, it makes sense that it has no dependency on MSDTC. Therefore it has no support for distributed transactions.