How to implement batch update using Spring Data Jpa?
I have a Goods Entity, and for diff user level there are diff price,e.g.
goodsId level price
1 1 10
1 2 9
1 3 8
when update goods I want to batch update these prices, like below:
#Query(value = "update GoodsPrice set price = :price where goodsId=:goodsId and level=:level")
void batchUpdate(List<GoodsPrice> goodsPriceList);
but it throws exception,
Caused by: java.lang.IllegalArgumentException: Name for parameter binding must not be null or empty! For named parameters you need to use #Param for query method parameters on Java versions < 8.
So how to implement batch update using Spring data Jpa correctly?
I think this is not possible with Spring Data JPA according to the docs. You have to look at plain JDBC, there are a few methods regarding batch insert/updates.
However, you can do it with Hibernate fairly easy.
Following this example: Spring Data JPA Batch Inserts, I have created my own way of updating it without having to deal with EntityManager.
The way I did it is first to retrieve all the data that I want to update, in your case, it will be WHERE goodsId=:goodsId AND level=:level. And then I use a for loop to loop through the whole list and setting the data I want
List<GoodsPrice> goodsPriceList = goodsRepository.findAllByGoodsIdAndLevel();
for(GoodsPrice goods : goodsPriceList) {
goods.setPrice({{price}});
}
goodsRepository.saveAll(goodsPriceList);
Some of the followings are needed for inserts or updated. having generate_statistics on so that you can see if you are really batching it
// for logging purpose, to make sure it is working
spring.jpa.properties.hibernate.generate_statistics=true
// Essentially key to turn it on
spring.jpa.properties.hibernate.jdbc.batch_size=4
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
and the log is here
27315 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
603684 nanoseconds spent preparing 4 JDBC statements;
3268688 nanoseconds spent executing 3 JDBC statements;
4028317 nanoseconds spent executing 2 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
6392912 nanoseconds spent executing 1 flushes (flushing a total of 3 entities and 0 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
If you're using Hibernate, you do have an option if you're willing to manager the transactions yourself.
Here is the test example
int entityCount = 50;
int batchSize = 25;
EntityManager entityManager = null;
EntityTransaction transaction = null;
try {
entityManager = entityManagerFactory().createEntityManager();
transaction = entityManager.getTransaction();
transaction.begin();
for ( int i = 0; i < entityCount; ++i ) {
if ( i > 0 && i % batchSize == 0 ) {
entityManager.flush();
entityManager.clear();
transaction.commit();
transaction.begin();
}
Post post = new Post(String.format( "Post %d", i + 1 ) );
entityManager.persist( post );
}
transaction.commit();
} catch (RuntimeException e) {
if ( transaction != null && transaction.isActive()) {
transaction.rollback();
}
throw e;
} finally {
if (entityManager != null) {
entityManager.close();
}
}
It would also be recommended to set the following properties to something that fits your needs.
<property
name="hibernate.jdbc.batch_size"
value="25"
/>
<property
name="hibernate.order_inserts"
value="true"
/>
<property
name="hibernate.order_updates"
value="true"
/>
All of this was taken from the following article.
The best way to do batch processing with JPA and Hibernate
One more link to be added to this conversation for more reference
Spring Data JPA batch insert/update
Few things to be considered while trying a Batch Insert/ Update using Spring Data JPA
GenerationType used for #Id creation of Entity
Setting up of Batch size attribute of Hibernate Property
Setting order_inserts (for making Batch_size effective on Insert statements) or order_updates (for making Batch_size effective on Update statements) attribute of Hibernate properties
Setting batch_versioned_data in order to make order_updates effective for Batch_size implementation
We can enable query batch processing programmatically. If there's a bulk update required processing thousands of records then we can use the following code to achieve the same.
You can define your own batch size and call updateEntityUtil() method to simple trigger an update query either using Spring Data JPA or a native query.
Code Snippet :-
// Total number of records to be updated
int totalSize = updateEntityDetails.size();
// Total number of batches to be executed for update-query batch processing
int batches = totalSize / batchSize;
// Calculate last batch size
int lastBatchSize = totalSize % batchSize;
// Batching-process indexes
int batchStartIndex = 0, batchEndIndex = batchSize - 1;
// Count of modified records in database
int modifiedRecordsCount = 0;
while (batches-- > 0) {
// Call updateEntityUtil to update values in database
modifiedRecordsCount += updateEntityUtil(batchStartIndex, batchEndIndex, regionCbsaDetails);
// Update batch start and end indexes
batchStartIndex = batchEndIndex + 1;
batchEndIndex = batchEndIndex + batchSize;
}
// Execute last batch
if (lastBatchSize > 0)
modifiedRecordsCount += updateEntityUtil(totalSize - lastBatchSize, totalSize - 1,
updateEntityDetails);
Related
I wanted to design a complete end-to-end workflow orchestration engine.
It has the following requirements
Linear workflow
Parallel workflow - I wanted to execute n no of activities parallelly. After validates the results from all the activities I wanted to
proceed to the next state or will fail the workflow
Batch - say I have 30 activities to be completed but I want this to be done in a batch fashion. Like if the window size is 5 then I wanted
to execute 5 activities at a time T. After executing all the activities
and validates the results will proceed further or fail the workflow.
Loop - wanted to run an activity infinitely until some condition meets
Child Workflow
Polling
All 1-5 are supported easily in Cadence workflow. I am not sure what you mean by Polling. If you can provide more details, I will update this answer to help you.
Here is the sample to execute activities in Leaner+parallel/batch+loop:
#Override
public long calculate(long a, long b, long c) {
LOGGER.info("workflow start...");
long result = 0;
// Async.invoke takes method reference and activity parameters and returns Promise.
Promise<Long> ab = Async.function(activities::multiple, a, b);
Promise<Long> ac = Async.function(activities::multiple, a, c);
Promise<Long> bc = Async.function(activities::multiple, b, c);
// Promise#get blocks until result is ready.
this.abPlusAcPlusBc = result = ab.get() + ac.get() + bc.get();
// waiting 30s for a human input to decide the factor N for g(n), based on a*b+a*c+b*c
// the waiting timer is durable, independent of workers' liveness
final boolean received = Workflow.await(Duration.ofMinutes(2), () -> this.factorForGn > 1);
if (!received) {
this.factorForGn = 10;
}
long fi_1 = 0; // f(0)
long fi_2 = 1; // f(1)
this.currentG = 1; // current g = f(0)*f(0) + f(1)*f(1)
long i = 2;
for (; i < this.factorForGn; i++) {
// get next fibonacci number
long fi = fi_1 + fi_2;
fi_2 = fi_1;
fi_1 = fi;
this.currentG += activities.multiple(fi, fi);
}
result += this.currentG;
return result;
}
And this is the sample of using ChildWorkflow.
I have a question regarding Postgres autovacuum / vacuum settings.
I have a table with 4.5 billion rows and there was a period of time with a lot of updates resulting in ~ 1.5 billion dead tuples. At this point autovacuum was taking a long time (days) to complete.
When looking at the pg_stat_progress_vacuum view I noticed that:
max_dead_tuples = 178956970
resulting in multiple index rescans (index_vacuum_count)
According to docs - max_dead_tuples is a number of dead tuples that we can store before needing to perform an index vacuum cycle, based on maintenance_work_mem.
According to this one dead tuple requires 6 bytes of space.
So 6B x 178956970 = ~1GB
But my settings are
maintenance_work_mem = 20GB
autovacuum_work_mem = -1
So what am I missing? why didn't all my 1.5b dead tuples fit in max_dead_tuples, since 20GB should give enough space, and why there were multiple runs necessary?
There is a hard-coded limit of 1GB for the number of dead tuples in one VACUUM cycle, see the source:
/*
* Return the maximum number of dead tuples we can record.
*/
static long
compute_max_dead_tuples(BlockNumber relblocks, bool useindex)
{
long maxtuples;
int vac_work_mem = IsAutoVacuumWorkerProcess() &&
autovacuum_work_mem != -1 ?
autovacuum_work_mem : maintenance_work_mem;
if (useindex)
{
maxtuples = MAXDEADTUPLES(vac_work_mem * 1024L);
maxtuples = Min(maxtuples, INT_MAX);
maxtuples = Min(maxtuples, MAXDEADTUPLES(MaxAllocSize));
/* curious coding here to ensure the multiplication can't overflow */
if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
maxtuples = relblocks * LAZY_ALLOC_TUPLES;
/* stay sane if small maintenance_work_mem */
maxtuples = Max(maxtuples, MaxHeapTuplesPerPage);
}
else
maxtuples = MaxHeapTuplesPerPage;
return maxtuples;
}
MaxAllocSize is defined in src/include/utils/memutils.h as
#define MaxAllocSize ((Size) 0x3fffffff) /* 1 gigabyte - 1 */
You could lobby on the pgsql-hackers list to increase the limit.
I an using a discrete event simulator in AnyLogic. I am having an issue with some code which updates a variable in my simulation. I store both the datetime at which the agent leaves the source block and the datetime at which it enters the sink block. I am trying to record the number of "rule breaks" for all agents. The rule break is defined below (two ways to break):
1) If the agent is received before a certain time (called SDC) and the agent is not completed by 5pm the same day, then the agent has broken the rule
2) If the agent is not completed by the next day at a certain time (called NDC), then the agent has broken the rule
I record a zero or a one for each agent if they break either rule in the variable called RuleBreak. However, in my simulation runs, the variable does not update at all. I hope I am just missing something small. Would appreciate any help! (code below)
Calendar received = Calendar.getInstance();
received.setTime(ReceivedDate);
Calendar completion = Calendar.getInstance();
completion.setTime(Completion);
Calendar SD_at_5 = Calendar.getInstance();
SD_at_5.setTime(ReceivedDate);
SD_at_5.set(Calendar.HOUR_OF_DAY,17);
SD_at_5.set(Calendar.MINUTE, 0);
SD_at_5.set(Calendar.SECOND, 0);
Calendar Tomorrow_at_NDC = Calendar.getInstance();
Tomorrow_at_NDC.setTime(ReceivedDate);
if(Tomorrow_at_NDC.get(Calendar.DAY_OF_WEEK) == 6)
Tomorrow_at_NDC.add(Calendar.DATE, 3);
else
Tomorrow_at_NDC.add(Calendar.DATE, 1);
Tomorrow_at_NDC.add(Calendar.DATE, 1);
Tomorrow_at_NDC.set(Calendar.HOUR_OF_DAY,NDC);
Tomorrow_at_NDC.set(Calendar.MINUTE, 0);
Tomorrow_at_NDC.set(Calendar.SECOND, 0);
int Either_rule_break = 0;
double time_diff_SDC = differenceInCalendarUnits(TimeUnits.SECOND,completion.getTime(),SD_at_5.getTime());
double time_diff_NDC = differenceInCalendarUnits(TimeUnits.SECOND,completion.getTime(),Tomorrow_at_NDC.getTime());
if((received.get(Calendar.HOUR_OF_DAY) < SDC) && (time_diff_SDC <= 0))
Either_rule_break = Either_rule_break + 1;
else
Either_rule_break = Either_rule_break + 0;
if((received.get(Calendar.HOUR_OF_DAY) >= SDC) && (time_diff_NDC <= 0))
Either_rule_break = Either_rule_break + 1;
else
Either_rule_break = Either_rule_break + 0;
if((Either_rule_break >= 1))
RuleBreak = RuleBreak + 1;
else
RuleBreak = RuleBreak + 0;
You haven't really explained where this code is used and what it receives. I assume the code is in a function, called in the sink's on-enter action, where ReceivedDate and Completion are Date instances stored per agent (source exit time and sink entry time, as dates, captured via AnyLogic's date() function).
And looks like your SDC hour-of-day is stored in SDC and your NDC hour-of-day in NDC (with RuleBreak being a variable in Main or similar storing the total number of rule-breaks).
Your calculations look OK except that the Tomorrow_at_NDC Calendar calculation seems wrong: you add 1 day twice (if not Saturday) or 3 days plus 1 day (if Saturday; in a Java Calendar, day-of-week 1 is Monday).
(Your Java is also very 'inefficient' with unnecessary extra local variables and performing logic when you don't need to; e.g., no point doing all the calendar preparation and check for your type 1 rule-break if the receive time is after the SDC hour.)
But are you sure there are any rule-breaks; how have you set up your model to ensure that there are (to test it)? Plus is RuleBreak definitely a variable outside of the agents that flow through your DES (i.e., in Main or similar)? Plus are Completion and ReceivedDate definitely stored per agent so, for example, if your function was called checkForRuleBreaks you would be doing something like the below in your sink on-exit action:
agent.Completion = date(); // Agent received date set earlier in Source action
checkForRuleBreaks(agent.ReceivedDate, agent.Completion);
(In fact, you don't need to store the completion date in the agent at all since that will always be the current sim-date inside your function and so you can just calculate it there.)
int rq_begin = 0, rq_end = 0;
int av_begin = 0, av_end = 0;
#define MAX_DUR 10
#define RQ_DUR 5
proctype Writer() {
do
:: (av_end < rq_end) -> av_end++;
if
:: (av_end - av_begin) > MAX_DUR -> av_begin = av_end - MAX_DUR;
:: else -> skip;
fi
printf("available span: [%d,%d]\n", av_begin, av_end);
od
}
proctype Reader() {
do
:: d_step {
rq_begin++;
rq_end = rq_begin + RQ_DUR;
}
printf("requested span: [%d,%d]\n", rq_begin, rq_end);
(rq_begin >= av_begin && rq_end <= av_end);
printf("got requested span\n");
od
}
init {
run Writer();
run Reader();
}
This system (only an example) should model a reader/writer queue where the reader requests a certain span of frames [rq_begin,rq_end], and the writer should then make at least this span available. [av_begin,av_end] is the span of available frames.
The 4 values are absolute frame indices, rq_begin gets incremented infinitley as the reader reads the next span of frames.
The system cannot be directly verified because the values are unranges (generating infinitely many states). Does Promela/Spin (or a similar software) has support to verify a system like this, and automatically transform it such that it becomes finite?
For example if all the 4 values were incremented by the same amount, the situation would still be the same. Or it could be reformulated into a model which instead has variables for the differences of these values, for example av_end - rq_end.
I'm using Promela/Spin to verify a more complex queuing system which uses absolute frame indices like this.
I am building an app that will store stock tick data. one of my methods needs to take an array of items, currently c. 100,000 items but will be over 1,000,000 at a time, and add them to an entity database. this is my current add logic:-
var tickContext = new DataAccess();
// array is created, add it to the entity database
for (int j = 0; j < tickDataArray.Length; j++)
{
MarketTickData temp2 = new MarketTickData();
if(j > 1)
temp2 = tickDataArray[j-1];
MarketTickData temp = tickDataArray[j];
tickContext.TickBarData.Add(tickDataArray[j]);
tickContext.Configuration.AutoDetectChangesEnabled = false;
tickContext.Configuration.ValidateOnSaveEnabled = false;
if (j % 200 == 0)
{
tickContext.SaveChanges();
tickContext.Dispose();
tickContext = new DataAccess();
}
}
// add remaining items to database
tickContext.SaveChanges();
tickContext.Configuration.AutoDetectChangesEnabled = true;
tickContext.Configuration.ValidateOnSaveEnabled = true;
When I test my logic on various sizes of tickContext before saving I am not seeing huge improvments in preformance. At a Context size of 50 my adds are taking 47 Seconds, 1000 is 54 Seconds and the best is 200 items taking 44 seconds. But it is still relatively slow.
Is there a way to improve this ?