JPA processing row by row with locking - postgresql

I have a problem with JPA (Postgres db). Let's assume we have a table like:
id | data
1 | aaa
2 | bbb
3 | ccc
4 | ddd
It is kind of a queue and I want to process it like a queue (please do not ask why). I have multiple applications that will 'consume' that data. And I would like to allow them to work concurrently.
What I need to do is to read the first row from the table (can be any, not necessarily the first one), process it and finally remove it from the table. Second application/pod should not access the row currently processed by the first one.
Also, when the first pod processes the row, I do not want to lock the entire table.
Let's assume I have two pods: P1 and P2:
P1 starts processing row number 1,
P2 wants to execute the same logic, but it shouldn't see the row number 1 (which is currently under processing by P1), it should retrieve and handle row number 2.
How to achieve that with JPA? I tried some JPA queries with #Lock(LockModeType.PESSIMISTIC_WRITE) but I am locking the entire table then (second app waits till I leave #Transactional block in the first one).
If there was a query to delete a row and return the deleted data - it would be the easiest approach with proper transaction level.
Any ideas?

Related

Efficient Way To Update Multiple Records In PostgreSQL

I have to update multiple records in a table with the most efficient way out there having the least latency and without utilising CPU extensively. At a time records to update can be ranged from 1 to 1000.
We do not want to lock the database when this update occurs as other services are utilising it.
Note: There are no dependencies generated from this table towards any other table in the system.
After looking in many places I've drilled down a few ways to do the task-
simple-update: A simple update query to the table with update command with already known id's
Either multiple update queries (one query for each individual record), or
Usage of update ... from clause as mentioned here as a single query (one query for all records)
delete-then-insert: Firstly, delete the outdated data and then insert updated data with new id's (since there is no dependency on records, new id's are acceptable)
insert-then-delete: Firstly, insert updated records with new id's and then delete outdated data using old id's (since there is no dependency on records, new id's are acceptable)
temp-table: Firstly, insert updated records into a temporary table. Secondly, update the original table with inserted records from the temporary table. At last, remove the temporary table.
We must not drop the existing table and create a new one in its place
We must not truncate the existing table because we have a huge number of records that we cannot store in the buffer memory
I'm open to any more suggestions.
Also, what will be the impact of making the update all at once vs doing it in batches of 100, 200 or 500?
References:
https://dba.stackexchange.com/questions/75631
https://dba.stackexchange.com/questions/230257
As mentioned by #Frank Heikens in the comments, I'm sure that different people will have different statistics based on their system design. I did some checks and I have found some insights to share for one of my development systems.
Configurations of the system used:
AWS
Engine: PostgreSQL
Engine version: 12.8
Instance class: db.m6g.xlarge
Instance vCPU: 4
Instance RAM: 16GB
Storage: 1000 GiB
I used a lambda function and pg package to write data into a table (default FILLFACTOR) that contains 34,09,304 records.
Both lambda function and database were in the same region.
UPDATE 1000 records into the database with a single query
Run
Time taken
1
143.78ms
2
115.277ms
3
98.358ms
4
98.065ms
5
114.78ms
6
111.261ms
7
107.883ms
8
89.091ms
9
88.42ms
10
88.95ms
UPDATE 1000 records into the database with a single query in 2 batches of 500 records concurrently
Run
Time taken
1
43.786ms
2
48.099ms
3
45.677ms
4
40.578ms
5
41.424ms
6
44.052ms
7
42.155ms
8
37.231ms
9
38.875ms
10
39.231ms
DELETE + INSERT 1000 records into the database
Run
Time taken
1
230.961ms
2
153.159ms
3
157.534ms
4
151.055ms
5
132.865ms
6
153.485ms
7
131.588ms
8
135.99ms
9
287.143ms
10
175.562ms
I did not proceed to check for updating records with the help of another buffer table because I had found my answer.
I've seen the database metrics graph provided by the AWS and by looking into those it was clear that DELETE + INSERT was more CPU intensive. And from the statistics shared above DELETE + INSERT took more time as compared to UPDATE.
If updates are done concurrently in batches, yes, updates will be faster, depending on the number of connections (a connection pool is recommended).
Using a buffer table, truncate, and other methods might be more suitable approaches if needed to update almost all the records in a giant table, though I currently do not have metrics to support this. However, for a limited number of records, UPDATE is a fine choice to proceed with.
Also, if not executed properly, please be mindful that if DELETE + INSERT fails, you might lose records and if INSERT + DELETE fails you might end up having duplicate records.

Redirecting Data in Postgres

I have a project where all the data is going to one Postgres big table like:
Time
ItemID
ItemType
Value
2021-09-16
1
A
2
2021-09-16
2
B
3
2021-09-17
3
A
3
My issue is this table is becoming very large. Since there are only 2 ItemTypes, I'd like to have MyTableA and MyTableB and the have a 3rd table with a one-to-one mapping of ItemID to ItemType.
What is the most performant way to insert the data and redirect to the respective table? I am currently thinking about creating a view with an INSTEAD OF trigger and then using two insert statements with joins to get the desired filtering. Is there a better way? Perhaps maintaining the ItemID_A/B in a array somewhere? Or should I figure out a way to do this client side?

Postgresql sequence: lock strategy to prevent record skipping on a table queue

I have a table that acts like a queue (let's call it queue) and has a sequence from 1..N.
Some triggers inserts on this queue (the triggers are inside transactions).
Then external machines have the sequence number and asks the remote database: give me sequences greater than 10 (for example).
The problem:
In some cases transaction 1 and 2 begins (numbers are examples). But transaction 2 ends before transaction 1. And in between host have asked queue for sequences greater than N and transaction 1 sequences are skipped.
How to prevent this?
I would proceed like this:
add a column state to the table that you change as soon as you process an entry
get the next entry with
SELECT ... FROM queuetab
WHERE state = 'new'
ORDER BY seq
LIMIT 1
FOR UPDATE SKIP LOCKED;
update state in the row you found and process it
As long as you do the last two actions in a single transaction, that will make sure that you are never blocked, get the first available entry and never skip an entry.

Kotlin JPA One-To-Many #ElementCollection attempts to save duplicates resulting in constraint violations

I have the following entity defined (simplified)...
#Entity(name = "metrics")
data class MetricsEntity(
val name: String,
// ... other properties omitted for clarity
#ElementCollection(fetch = FetchType.EAGER)
#MapKeyColumn(name = "event_name")
#Column(name = "event_count")
#CollectionTable(name = "metric_event", joinColumns = [JoinColumn(name = "metrics_id")])
val events: MutableMap<String, Int>,
)
The idea here is that for every entry in the metrics table we record counts of events and we end up with a table containing entries like this...
metrics_id | event_name | event_count
------------+-----------------------------------+-------------
15647624 | Launched | 1
15647624 | Registration_successful | 10
15647624 | Registration_failed | 2
15647624 | History_viewed | 1
In the code we load the metrics entities using something like this...
val metrics = metricsRepository.findByProperties(properties)
...to get a single metric. The selection criteria here have been simplified but suffice to say we get one metric instance from this query. the repository here is defined as...
interface MetricsRepository : CrudRepository<MetricsEntity, Long> {
...
}
Now we either update the events map to add a new count of one of increment an existing count using the following code...
metrics.events[eventName] = (it.events[eventName] ?: 0) + 1
metricsRepository.save(it)
This works the vast majority of the time but every now and again the call to save throws a constraint violation called metric_event_constraint on the above table which is defined as...
ALTER TABLE metric_event ADD CONSTRAINT metric_event_constraint UNIQUE (metrics_id, name);
This seems to suggest that the save operation is saving a new row when a row already exists. Looking at the log is suggests that I am getting collisions between multiple threads trying to modify the same count...
08:58:18.466 INFO 3 --- [TaskExecutor-11] incr count for event Launched, count 1
08:58:18.487 INFO 3 --- [cTaskExecutor-4] incr count for event Launched, count 1
08:58:18.618 ERROR 3 --- [cTaskExecutor-4] incr count failed for request
08:58:24.623 INFO 3 --- [TaskExecutor-94] incr count for event Launched, count 2
08:59:14.951 INFO 3 --- [askExecutor-126] incr count for event Launched, count 3
...here the first event works and increments the count, the second one fails and doesn't (constraint violation caught) and the 3rd and 4th work fine. Total count is 3 when we wanted 4. It looks to me like the 1st and 2nd event collided.
So the question is firstly do you think that this summary is correct? Secondly, how do I make it work ;)? My assumption is that I need to lock the metric entity so how would I go about that using the framework of crud repositories and entity classes used?
The database behind this is Postgres.
Regards,
Mark
So the question is firstly do you think that this summary is correct?
That's quite possible.
how do I make it work ;)?
You can lock the entity while fetching (using the second argument to EntityManager.find() if interacting with EntityManger directly, or the #Lock annotation on Spring Data repository method. For your scenario, I'm guessing you'll want optimistic locking, which means you'll need to add a #Version field to your entity (you might also want to retry with a pessimistic lock if the optimistic one fails, but if collisions are scarce, purely optimistic locking is probably the way to go).
Of course, if you want both modification attempts to succeed, you'll need to capture and handle the locking exception by retrying the operation (I believe Spring repos will wrap it into a ObjectOptimisticLockingFailureException, but you'll have to check).

How to delete data from an RDBMS using Talend ELT jobs?

What is the best way to delete from a table using Talend?
I'm currently using a tELTJDBCoutput with the action on Delete.
It looks like Talend always generate a DELETE ... WHERE EXISTS (<your generated query>) query.
So I am wondering if we have to use the field values or just put a fixed value of 1 (even in only one field) in the tELTmap mapping.
To me, putting real values looks like it useless as in the where exists it only matters the Where clause.
Is there a better way to delete using ELT components?
My current job is set up like so:
The tELTMAP component with real data values looks like:
But I can also do the same thing with the following configuration:
Am I missing the reason why we should put something in the fields?
The following answer is a demonstration of how to perform deletes using ETL operations where the data is extracted from the database, read in to memory, transformed and then fed back into the database. After clarification, the OP specifically wants information around how this would differ for ELT operations
If you need to delete certain records from a table then you can use the normal database output components.
In the following example, the use case is to take some updated database and check to see which records are no longer in the new data set compared to the old data set and then delete the relevant rows in the old data set. This might be used for refreshing data from one live system to a non live system or some other usage case where you need to manually move data deltas from one database to another.
We set up our job like so:
Which has two tMySqlConnection components that connect to two different databases (potentially on different hosts), one containing our new data set and one containing our old data set.
We then select the relevant data from the old data set and inner join it using a tMap against the new data set, capturing any rejects from the inner join (rows that exist in the old data set but not in the new data set):
We are only interested in the key for the output as we will delete with a WHERE query on this unique key. Notice as well that the key has been selected for the id field. This needs to be done for updates and deletes.
And then we simply need to tell Talend to delete these rows from the relevant table by configuring our tMySqlOutput component properly:
Alternatively you can simply specify some constraint that would be used to delete the records as if you had built the DELETE statement manually. This can then be fed in as the key via a main link to your tMySqlOutput component.
For instance I might want to read in a CSV with a list of email addresses, first names and last names of people who are opting out of being contacted and then make all of these fields a key and connect this to the tMySqlOutput and Talend will generate a DELETE for every row that matches the email address, first name and last name of the records in the database.
In the first example shown in your question:
you are specifically only selecting (for the deletion) products where the SOME_TABLE.CODE_COUNTRY is equal to JS_OPP.CODE_COUNTRY and SOME_TABLE.FK_USER is equal to JS_OPP.FK_USER in your where clause and then the data you send to the delete statement is setting the CODE_COUNTRY equal to JS_OPP.CODE_COUNTRY and FK_USER equal to JS_OPP.CODE_COUNTRY.
If you were to put a tLogRow (or some other output) directly after your tELTxMap you would be presented with something that looks like:
.----------+---------.
| tLogRow_1 |
|=-----------+------=|
|CODE_COUNTRY|FK_USER|
|=-----------+------=|
|GBR |1 |
|GBR |2 |
|USA |3 |
'------------+-------'
In your second example:
You are setting CODE_COUNTRY to an integer of 1 (your database will then translate this to a VARCHAR "1"). This would then mean the output from the component would instead look like:
.------------.
|tLogRow_1 |
|=-----------|
|CODE_COUNTRY|
|=-----------|
|1 |
|1 |
|1 |
'------------'
In your use case this would mean that the deletion should only delete the rows where the CODE_COUNTRY is equal to "1".
You might want to test this a bit further though because the ELT components are sometimes a little less straightforward than they seem to be.