We have a Spring Boot project that uses Spring-JPA for data access. We have a couple of tables where we create/update rows once (or a few times, all within minutes). We don't update rows that are older than a day. These tables (like audit table) can get very large and we want to use Postgres' table partitioning features to help break up the data by month. So the main table always has this calendar month's data but if the query requires retrieval from previous months it would somehow read it from other partitions.
Two questions:
1) Is this a good idea for archiving older data but still leave it query-able?
2) Does Spring-JPA work with partitioned tables? Or do we have to figure out how to break up the query and do native queries and concatenate the restult set?
Thanks.
I am working with postgres partitioning with Hibernate & Spring JPA for a period of time. So I think, I can try to answer your questions.
1) Is this a good idea for archiving older data but still leave it query-able?
If you are applying indexes and not re-indexing table frequently, then partitioning of data may result faster query results.
Also you can use clustered index feature in postgres as well to fetch the data faster.
Because table with older data will not going to be updated, so clustered index will improve the performance efficiently.
2) Does Spring-JPA work with partitioned tables? Or do we have to figure out how to break up the query and do native queries and concatenate the restult set?
Spring JPA will work out of the box with partitioned table. It will retrieve the data from master as well as child tables and returns the concatenated result set.
Note : Issue with partitioned table
The only issue you will face with partitioned table is insertion in partitioned table.
Let me explain, when you partition a table, you will create a trigger over master table, and that trigger will return null. This is the key behind insertion issue in partitioned table using Spring JPA / Hibernate.
When you try to insert a row using Spring JPA or Hibernate you will face below issue
Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
To overcome this issue you need to override implementation of Batching batcher.
In hibernate you can provide the custom implementation of batcher factory using below configuration
hibernate.jdbc.factory_class=path.to.my.batcher.factory.implementation
In spring JPA you can achieve the same by custom implementation of batch builder using below configuration
hibernate.jdbc.batch.builder=path.to.my.batch.builder.implementation
References :
Custom Batch Builder/Batch in Spring-JPA
Demo Application
In addition to the #Anil Agrawal answer.
If you are using spring boot 2 then you need to define the customBatcher using the property.
spring.jpa.properties.hibernate.jdbc.batch.builder=net.xyz.jdbc.CustomBatchBuilder
You do not have to break down the JDBC query with postgres 11+.
If you execute select on the main table with plain jdbc, the DB would return the aggregated results from the partitioned tables.
In other words, the work is done by the Postgres DB, so Spring JPA will simply get the result and map it to objects as if there were no partitioning.
For having inserts work in a partitioned table you need to make sure that your partitions are already created, i think spring data will not create them for you.
Related
When I use a secondary table, saving the entities takes 2 times the time it takes without a secondary table.
However, the inserts to the secondary table (postgresql) takes less than 1ms according to postgres logs. So, I guess it's something with hibernate itself. Is there any known performance issue with secondary tables in hibernate? I'm using hibernate 2.1
I am trying to find the best solution to build a database relation. I need something to create a table that will contain data split across other tables from different databases. All the tables got exactly the same structure (same column number, names and types).
In the single database, I would create a parent table with partitions. However, the volume of the data is too big to do it in a single database that's why I am trying to do a split. From the Postgres documentation what I think I am trying to do is "Multiple-Server Parallel Query Execution".
At the moment the only solution I think to implement is to build API of databases address and use it to get data across the network into the main parent database when needed. I also found Postgres external extension called Citus that might do the job but I don't know how to implement the unique key across multiple databases (or Shards like Citus call it).
Is there any better way to do it?
Citus would most likely solve your problem. It lets you use unique keys across shards if it is the distribution column, or if it is a composite key and contains the distribution column.
You can also use distributed-partitioned table in citus. That is a partitioned table on some column (timestamp ?) and hash distributed table on some other column (like what you use in your existing approach). Query parallelization and data collection would be handled by Citus for you.
I am working in a Spring batch application in Spring boot which will be running in two different instances, where I have a scenario in which I have to retrieve unique rows from a table. By unique I mean, one row per instance. For example,
id language
1 java
2 python
if I have two rows and when I call a SELECT query with limit one, For first instance I should get id 1 and for second instance id 2 should be returned. So far I have tried using JPA Lock #Lock(value = LockModeType.PESSIMISTIC_WRITE) This doesn't work. Each time I get the same row. I have also tried using JdbcTemplate with SELECT * FROM some_table LIMIT 1 FOR UPDATE SKIP LOCKED. This is also not working. My postgres version is 10.3 . Is there a way to achieve this.
Number of instances of my application might grow in the future. So I want to handle this as well.
Thanks in advance.
You want each instance to process a different partition of your table. In this case, I would recommend using a partitioned step.
For example, you can partition the table by even/odd IDs, and make each instance process a partition. This is IMO better than locking the table and using LIMIT 1 to force each instance read one row (This won't work as you mentioned and even if it works, it would be very poor in terms of performance).
You can find a sample job of how to partition a table here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/resources/jobs/partitionJdbcJob.xml along with the corresponding partitioner here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/java/org/springframework/batch/sample/common/ColumnRangePartitioner.java
I have some set of tables which has 20 million records in a postgres server. As of now i m migrating some table data from one server to another server using insert and update queries with dependent tables in functions. It takes around 2 hours even after optimizing the query. I need a solution to migrate the data faster by using mongodb or cassandra. How?
Try putting your updates and inserts into a file and then load the file. I understand Postgresql will optimise loading the file contents. It's always worked for me although I haven't used that quantity.
I have just started using JPA (with EclipseLink implementation). I have a very simply select query, like
(1) entityManager.find(SomeEntity.class, SomeEntityPK);
(2) entityManager.createQuery("Select x from SomeEntity x where x.isDefault = true").getResultList();
The number of records in SomeEntity table is approx 50 (very small table).
Query (1) initially takes 3s, but subsequent hit just takes 200ms. Obviously cache is at play.
However Query (2) takes 2s for all invocations- wonder why cache is not used. I understand Query (those not using Id or Index) always hits DB and Entity relationships are utilized from Cache.
Is there any way to improve the performance? A simple JDBC select just takes <300ms to fetch data for Query (2).
[UPDATE]
I think I have solved the issue. One of the columbs in table 'SomeEntity' was Oracle XMLType. Due to some issue, I had to remove this field and instead use a CLOB field to store xml data. and voila, JPA suddenly started caching the query result. Although I don't know reason why JPA doesn't caches XMLType.
EclipseLink has a number of caches at different levels that can be used. I think the query cache is what you might be looking for described here
http://docs.oracle.com/cd/E25054_01/core.1111/e10108/toplink.htm#BCGEGHGE
And explained a bit here
http://wiki.eclipse.org/Introduction_to_EclipseLink_Queries_%28ELUG%29#How_to_Cache_Query_Results_in_the_Query_Cache