Bulkindexing JPA Entities modified during Spring transaction to Elasticsearch index - jpa

I have an JPA Entity Class that is also an Elasticsearch Document. The enviroment is a Spring Boot Application using Spring Data Jpa and Spring Data Elasticsearch.
#Entity
#Document(indexname...etc)
#EntityListeners(MyJpaEntityListener.class)
public class MyEntity {
//ID, constructor and stuff following here
}
When an instance of this Entity gets created, updated or deleted it gets reindexed to Elasticsearch. This is currently achieved with an JPA EntityListener which reacts on PostPersist, PostUpdate and PostRemove events.
public class MyJpaEntityListener {
#PostPersist
#PostUpdate
public void postPersistOrUpdate(MyEntity entity) {
//Elasticsearch indexing code gets here
}
#PostRemove
public void postPersistOrUpdate(MyEntity entity) {
//Elasticsearch indexing code gets here
}
}
That´s all working fine at the moment when a single or a few entities get modified during a single transaction. Each modification triggers a separate index operation. But if a lot of entities get modified inside a transaction it is getting slow.
I would like to bulkindex all entities that got modified at the end (or after commit) of a transaction. I took a look at TransactionalEventListeners, AOP and TransactionSynchronizationManager but wasn´t able to come up with a good setup till now.
How can I collect all modified entities per transaction in an elegant way without doing it per hand in every service method myself?
And how can I trigger a bulkindex at the end of a transaction with the collected entities of this transaction.
Thanks for your time and help!

One different and in my opinion elegant approach, as you don't mix your services and entities with elasticsearch related code, is to use spring aspects with #AfterReturning in the service layer transactional methods.
The pointcut expression can be adjusted to catch all the service methods you want.
#Order(1) guaranties that this code will run after the transaction commit.
The code below is just a sample...you have to adapt it to work with your project.
#Aspect
#Component()
#Order(1)
public class StoreDataToElasticAspect {
#Autowired
private SampleElasticsearhRepository elasticsearchRepository;
#AfterReturning(pointcut = "execution(* com.example.DatabaseService.bulkInsert(..))")
public void synonymsInserted(JoinPoint joinPoint) {
Object[] args = joinPoint.getArgs();
//create elasticsearch documents from method params.
//can also inject database services if more information is needed for the documents.
List<String> ids = (List) args[0];
//create batch from ids
elasticsearchRepository.save(batch);
}
}
And here is an example with a logging aspect.

Related

JPA cache behaviour when invoke count() method on Spring Data JPA Repository

I'm writing a transactional junit-based IT test for Spring Data JPA repository.
To check number of rows in table I use side JDBCTemplate.
I notice, that in transactional context invoking of org.springframework.data.repository.CrudRepository#save(S) doesn't take effect. SQL insert in not performed, number of rows in table is not increased.
But If I invoke org.springframework.data.repository.CrudRepository#count after the save(S) then SQL insert is performed and number of rows is increased.
I guess this is behavior of JPA cache, but how it works in details?
Code with Spring Boot:
#RunWith(SpringRunner.class)
#SpringBootTest
public class ErrorMessageEntityRepositoryTest {
#Autowired
private ErrorMessageEntityRepository errorMessageEntityRepository;
#Autowired
private JdbcTemplate jdbcTemplate;
#Test
#Transactional
public void save() {
ErrorMessageEntity errorMessageEntity = aDefaultErrorMessageEntity().withUuid(null).build();
assertTrue(TestTransaction.isActive());
int sizeBefore= JdbcTestUtils.countRowsInTable(jdbcTemplate, "error_message");
ErrorMessageEntity saved = errorMessageEntityRepository.save(errorMessageEntity);
errorMessageEntityRepository.count(); // [!!!!] if comment this line test will fail
int sizeAfter= JdbcTestUtils.countRowsInTable(jdbcTemplate, "error_message");
Assert.assertEquals(sizeBefore+1, sizeAfter);
}
Entity:
#Entity(name = "error_message")
public class ErrorMessageEntity {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private UUID uuid;
#NotNull
private String details;
Repository:
public interface ErrorMessageEntityRepository extends CrudRepository<ErrorMessageEntity, UUID>
You are correct this is a result of how JPA works.
JPA tries to delay SQL statement execution as long as possible.
When saving new instances this means it will only perform an insert if it is required in order to get an id for the entity.
Only when a flush event occurs will all changes that are stored in the persistence context flushed to the database. There are three triggers for that event to happen:
The closing of the persistence context will flush all the changes. In a typical setup, this is tight to a transaction commit.
Explicitly calling flush on the EntityManager which you might do directly or when using Spring Data JPA via saveAndFlush
Before executing a query. Since you typically want to see your changes in a query.
Number 3 is the effect you are seeing.
Note that the details are a little more complicated since you can configure a lot of this stuff. As usual, Vlad Mihalcea has written an excellent post about it.
In order to make the test data not pollute the database, when using the unit test of Spring-test, the transaction will be rolled back by default, that is, #Rollback is true by default. If you want to test the data without rolling back, you can set #Rollback(value = false) . If you are using a MySQL database, after setting up automatic rollback, if you find that the transaction is still not rolled back, you can check whether the database engine is Innodb , because other database engines such as MyISAM and Memory do not support transactions.

Is it possible to persist/merge entities in parallel, in a single transaction, using JavaEE 7?

Suppose I have an EJB, CrudService, with methods to perform CRUD on a single entity (so not collections). CrudService has an EntityManager injected into it, and it cannot be modified.
CrudService looks something like this:
#Stateless
public class CrudService {
#PersistenceContext(name = "TxTestPU")
private EntityManager em;
public Integer createPost(Post post) {
post = em.merge(post);
return post.getId();
}
public void updatePost(Post post) {
em.merge(post);
}
public Post readPost(Integer id) {
return em.find(Post.class, id);
}
public void deletePost(Post post) {
em.remove(post);
}
}
I would like to be able to create/update a collection of Post entities, in parallel, in a single transaction. An approach which does not work, as for each thread in the pool a new transaction is created by the container, is the following :
#Stateless
public class BusinessBean {
#Inject
private CrudService crudService;
public void savePosts(Collection<Post> posts) {
posts.parallelStream().forEach(post ->
crudService.createPost(post);
}
}
Is there a way to do it ?
The code runs on Wildfly, with a Hibernate persistence unit and Postgresql database.
The straight "here is an answer" answer.
Not generically. Have a look at the answers to this question: Is it discouraged using Java 8 parallel streams inside a Java EE container?
The annoying "XY problem" answer.
How would you expect this would work? Most databases don't support multiple parallel transactions on a single database connection, I don't believe PG supports it: https://stackoverflow.com/a/289057/924597
So something/somebody (JEE container, JDBC, driver, etc.) would have to open multiple DB connections to achieve this - which I think you're saying is what is happening? If you're doing this across many different business actions this would likely exhaust your connection pool pretty quickly.
In the spirit of this being an "XY problem" answer - what problem are you trying to solve?
If it's just a raw throughput problem - consider batching your inserts.
If it's a bulk insert problem - consider making an end-run around your container and using a different tool, JEE containers aren't usually meant for/good at this kind of thing.

Performance problems using a Spring Batch FieldSetMapper to map into an object that will be written with a JpaItemWriter?

We are writing a set of Spring Batch jobs that read values from text files and use that information to update objects that are read and written from the database using JPA. These jobs are not run in a web container, but on an application server. My problem seems to be with how the EntityManager is configured.
The code reads files from various vendors that update an order's status. The text file specifies the customer by name and the order by date/time. If the customer doesn't exist, the line from the text file is skipped. If the order exists, we update it. If not, then we create it.
We currently use DeltaSpike to get instances of our DAO objects like this:
DependentProvider<CustomerDaoImpl> provider = BeanProvider.getDependent(CustomerDaoImpl.class);
ICustomerDao custDao = provider.get();
I cache the DAO objects in my mapper so I am only getting them once. But every call to BeanProvider.getDependent() creates a new EntityManager through "Spring Batch Magic." The EntityManager is specified thusly:
#Configuration
public class BaseBatchConfiguration {
#Bean
#Produces
public EntityManager entityManager() {
Map<String, String> properties = new HashMap<String, String>();
properties.put("hibernate.connection.url", System.getProperty("JDBC_URL"));
properties.put("hibernate.default_schema", System.getProperty("APP_SCHEMA"));
properties.put("hibernate.connection.username", System.getProperty("APP_DB_ID"));
properties.put("hibernate.connection.password", System.getProperty("APP_DB_PWD"));
EntityManagerFactory emf = Persistence.createEntityManagerFactory(System.getProperty("PU_NAME"), properties);
return emf.createEntityManager();
}
}
I tried caching the EntityManager, but a new instance of the BaseBatchConfiguration class is used every time. This means that each DAO gets created with it's own EntityManager, so no real object caching is taking place across DAOs (reading a customer with the CustomerDaoImpl isn't cached and used when OrderDaoImpl loads an order that references that same customer).
This is causing a lot of unwanted object loading as we process through the text file.
Is there some other way we should be declaring our EntityManager?

Why do changes to my JPA entity not get persisted to the database?

In a Spring Boot Applicaion, I have an entity Task with a status that changes during execution:
#Entity
public class Task {
public enum State {
PENDING,
RUNNING,
DONE
}
#Id #GeneratedValue
private long id;
private String name;
private State state = State.PENDING;
// Setters omitted
public void setState(State state) {
this.state = state; // THIS SHOULD BE WRITTEN TO THE DATABASE
}
public void start() {
this.setState(State.RUNNING);
// do useful stuff
try { Thread.sleep(2000); } catch(InterruptedException e) {}
this.setState(State.DONE);
}
}
If state changes, the object should be saved in the database. I'm using this Spring Data interface as repository:
public interface TaskRepository extends CrudRepository<Task,Long> {}
And this code to create and start a Task:
Task t1 = new Task("Task 1");
Task persisted = taskRepository.save(t1);
persisted.start();
From my understanding persisted is now attached to a persistence session and if the object changes this changes should be stored in the database. But this is not happening, when reloading it the state is PENDING.
Any ideas what I'm doing wrong here?
tl;dr
Attaching an instance to a persistence context does not mean every change of the state of the object gets persisted directly. Change detection only occurs on certain events during the lifecycle of persistence context.
Details
You seem to misunderstood the way change detection works. A very central concept of JPA is the so called persistence context. It is basically an implementation of the unit-of-work pattern. You can add entities to it in two ways: by loading them from the database (executing a query or issuing an EntityManager.find(…)) or by actively adding them to the persistence context. This is what the call to the save(…) method effectively does.
An important point to realize here is that "adding an entity to the persistence context" does not have to be equal to "stored in the database". The persistence provider is free to postpone the database interaction as long as it thinks is reasonable. Providers usually do that to be able to batch up modifying operations on the data. In a lot of cases however, an initial save(…) (which translates to an EntityManager.persist(…)) will be executed directly, e.g. if you're using auto id increment.
That said, now the entity has become a managed entity. That means, the persistence context is aware of it and will persist the changes made to the entity transparent, if events occur that need that to take place. The two most important ones are the following ones:
The persistence context gets closed. In Spring environments the lifecycle of the persistence context is usually bound to a transaction. In your particular example, the repositories have a default transaction (and thus persistence context) boundary. If you need the entity to stay managed around it, you need to extend the transaction lifecycle (usually by introducing a service layer that has #Transactional annotations). In web applications we often see the Open Entity Manager In View Pattern, which is basically a request-bound lifecycle.
The persistence context is flushed. This can either happen manually (by calling EntityManager.flush() or transparently. E.g. if the persistence provider needs to issue a query, it will usually flush the persistence context to make sure, currently pending changes can be found by the query. Imagine you loaded a user, changed his address to a new place and then issue a query to find users by their addresses. The provider will be smart enough to flush the address change first and execute the query afterwards.

Repository pattern with EF4 CTP5

I'm trying to implement the repository pattern with ef4 ctp5, I came up with something but I'm no expert in ef so I want to know if what I did is good.
this is my db context
public class Db : DbContext
{
public DbSet<User> Users { get; set; }
public DbSet<Role> Roles { get; set; }
}
and the repository: (simplified)
public class Repo<T> : IRepo<T> where T : Entity, new()
{
private readonly DbContext context;
public Repo()
{
context = new Db();
}
public IEnumerable<T> GetAll()
{
return context.Set<T>().AsEnumerable();
}
public long Insert(T o)
{
context.Set<T>().Add(o);
context.SaveChanges();
return o.Id;
}
}
You need to step back and think about what the repository should be doing. A repository is used for retrieving records, adding records, and updating records. The repository you created barely handles the first case, handles the second case but not efficiently, and doesn't at all handle the 3rd case.
Most generic repositories have an interface along the lines of
public interface IRepository<T> where T : class
{
IQueryable<T> Get();
void Add(T item);
void Delete(T item);
void CommitChanges();
}
For retrieving records, you can't just call the whole set with AsEnumerable() because that will load every database record for that table into memory. If you only want Users with the username of username1, you don't need to download every user for the database as that will be a very large database performance hit, and a large client performance hit for no benefit at all.
Instead, as you will see from the interface I posted above, you want to return an IQueryable<T> object. IQuerables allow whatever class that calls the repository to use Linq and add filters to the database query, and once the IQueryable is run, it's completely run on the database, only retrieving the records you want. The database is much better at sorting and filtering data then your systems, so it's best to do as much on the DB as you can.
Now in regards to inserting data, you have the right idea but you don't want to call SaveChanges() immediately. The reason is that it's best to call Savechanges() after all your db operations have been queued. For example, If you want to create a user and his profile in one action, you can't via your method, because each Insert call will cause the data to be inserted into the database then.
Instead what you want is to separate out the Savechanges() call into the CommitChanges method I have above.
This is also needed to handle updating data in your database. In order to change an Entity's data, Entity Framework keeps track of all records it has received and watches them to see if any changes have been made. However, you still have to tell the Entity Framework to send all changed data up to the database. This happenes with the context.SaveChanges() call. Therefore, you need this to be a separate call so you are able to actually update edited data, which your current implementation does not handle.
Edit:
Your comment made me realize another issue that I see. One downfall is that you are creating a data context inside of the repository, and this isn't good. You really should have all (or most) of your created repositories sharing the same instance of your data context.
Entity Framework keeps track of what context an entity is tracked in, and will exception if you attempt to update an entity in one context with another. This can occur in your situation when you start editing entities related to one another. It also means that your SaveChanges() call is not transactional, and each entity is updated/added/deleted in it's own transaction, which can get messy.
My solution to this in my Repositories, is that the DbContext is passed into the repository in the constructor.
I may get voted down for this, but DbContext already is a repository. When you expose your domain models as collection properties of your concrete DbContext, then EF CTP5 creates a repository for you. It presents a collection like interface for access to domain models whilst allowing you to pass queries (as linq, or spec objects) for filtering of results.
If you need an interface, CTP5 doesn't provide one for you. I've wrapped my own around the DBContext and simply exposed the publicly available members from the object. It's an adapter for testability and DI.
I'll comment for clarification if what I said isn't apparently obvious.