spring batch partitioning performance issue - spring-batch

We have a spring batch job where we trying to process around 10 million records. Now doing this in single thread will be very slow since we have to match SLA.
To improve performance, we have developed a POC where master step is creating partitions where each partition represent one unique prod id. This can range from anywhere between 500 to 4500. In POC we have 500 such unique prod id. Now each partition is being given a prod id and step work on it. All this end to end works fine.
What we noticed is that master steps takes more than 5min to send partition info to step execution request. What i mean by that is that, there is more than 5 min diff between master step generates partitions and step being executed for 1st partition.
What might be causing this slowness? What spring batch framework does during this 5 min?
Here are the 3 selects which is executed during that 5 min so many time
SELECT JOB_EXECUTION_ID, START_TIME, END_TIME, STATUS, EXIT_CODE, EXIT_MESSAGE, CREATE_TIME, LAST_UPDATED, VERSION, JOB_CONFIGURATION_LOCATION from BATCH_JOB_EXECUTION where JOB_INSTANCE_ID = ? order by JOB_EXECUTION_ID desc;
SELECT JOB_EXECUTION_ID, KEY_NAME, TYPE_CD, STRING_VAL, DATE_VAL, LONG_VAL, DOUBLE_VAL, IDENTIFYING from BATCH_JOB_EXECUTION_PARAMS where JOB_EXECUTION_ID = ?;
SELECT STEP_EXECUTION_ID, STEP_NAME, START_TIME, END_TIME, STATUS, COMMIT_COUNT, READ_COUNT, FILTER_COUNT, WRITE_COUNT, EXIT_CODE, EXIT_MESSAGE, READ_SKIP_COUNT, WRITE_SKIP_COUNT, PROCESS_SKIP_COUNT, ROLLBACK_COUNT, LAST_UPDATED, VERSION from BATCH_STEP_EXECUTION where JOB_EXECUTION_ID = ? order by STEP_EXECUTION_ID;

Take a look at your job repository's configuration. Once the Partitioner has created the ExecutionContexts for each slave step, the master creates a StepExecution for each before sending it to the slave to be processed. So that lag is probably due to the insertion of all of those StepExecutions into your job repository. As a follow up, make sure you're using the latest versions. There was an optimization done to that not too long ago (batch inserting the executions instead of doing it one by one).

Related

How to lock row for multi worker rate limit?

I have multiple workers distributed across multiple nodes that scrape HTML. I need to specify a rate limit in the system so no domain gets more than 1 request every 5 seconds.
Each worker has access to a shared database (PostgreSQL) so I created a table with 2 columns:
domain key, last scan date
In the worker code I want to check the last scan date before making a request. The problem is thousands of workers could get the same domain at almost the same instant if tasks are distributed round robin so if they all read at once they will see no recent scan and all fire off requests. So I need a way to lock the field so the first worker to check engages a lock, makes the scan and then updates the scan date and releases the lock. Then all the other workers can check to see if a lock exists on the row and reject the task so it is re-scheduled.
I read the manual page of locks and found it very confusing. It said all locks are table lock and I didn't really understand what it means about conflicts. I am going to need multiple workers to be able to lock/unlock different rows at the same time and also check if a lock exists before placing lock so that the worker doesn't hang waiting for the lock to release and can move onto next task.
What type of lock do I need? Are there any good examples showing this type of lock?
If I just wrap each process in a transaction will that work?
Your core code would be the block:
begin;
set transaction isolation level read committed; -- should already be the default
select domain_key
from your_table
where last_scan_date < now() - interval '5 seconds'
for update skip locked
limit 1;
-- Do your stuff here, and issue a rollback if it fails
update your_table
set last_scan_date = <value goes here>
where domain_key = <value goes here>;
commit;
I expect this will be used in a host language. The following example snippet of a worker is in python:
conn = psycopg2.connect('<db connect parameters>')
conn.autocommit = false
c = conn.cursor()
c.execute("set transaction isolation level read committed;")
c.execute("""
select domain_key
from your_table
where last_scan_date < now() - interval '5 seconds'
order by last_scan_date
for update skip locked
limit 1
""")
domain_key = c.fetchone()[0]
if domain_key:
result = process_url(domain_key) # <-- This is your scraping routine
if result == 'Ok':
c.execute("""
update your_table
set last_scan_date = now()
where domain_key = %s
""", (domain_key,))
conn.commit()
else:
conn.rollback()

Postgresql query construction

I need to write a query to poll a database table only if I ( the application process ) am leader. I plan on implementing leader election via database reservation ( lock a table and update leader record if available every so often ). How can I combine the leader election query with the polling query such that I am guaranteed that the polling query doesn't run if being run by a process that is not leader. This needs to be a db only solution ( for a variety of reasons ).
I'm thinking something like
SELECT *
FROM outbound_messages
WHERE status = 'READY'
AND 'JVM1' IN (SELECT jvm_name
FROM leader
WHERE leader_status = active)
Will this work ?
This seems overly complicated.
In PostgreSQL you can do that easily with advisory locks. Choose a bigint lock number (I chose 42) and query like this:
WITH ok(ok) AS (
pg_try_advisory_lock(42)
)
SELECT o.*
FROM outbound_messages o
CROSS JOIN ok
WHERE ok AND o.status = 'READY';
Only the first caller will obtain the lock and get a result.
You can release the lock by ending your session or calling
SELECT pg_advisory_unlock(42);

job queue with multiple consumers did the same job twice

Actually a lot of things might be covered here: Job queue as SQL table with multiple consumers (PostgreSQL)
However I just wanted to ask for my specific query.
Currently I have a job queue that actually should emit a new job for every consumer, however we found out that we sometimes gotten the same job twice on different consumer (probably a race condition.
This was our query (run inside a transaction):
UPDATE invoice_job SET status = 'working', date_time_start = now(),
node = $ip
WHERE id = (SELECT id FROM invoice_job WHERE status = 'created' ORDER BY id LIMIT 1)
RETURNING *
Currently the Table is pretty simple and has a status (can be "created", "working", "done", date_time_start field, created field (not used for query), id field, node (where the job was run).
However this emitted the same job twice at one point.
Currently I changed the query now to:
UPDATE invoice_job SET status = 'working', date_time_start = now(),
node = $ip
WHERE id = (SELECT id FROM invoice_job WHERE status = 'created' ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED)
RETURNING *
would that actually help and only emit the same job once?
Your solution with FOR UPDATE SKIP LOCKED is fine. It'll ensure a row is locked by exactly one session before being updated for processing. No transaction can choose a row already locked by another transaction, and when the lock is released on commit, subsequent SELECT clauses will no longer match the row.
The original failed because the subquery's SELECT can choose the same row concurrently in multiple sessions, each of which then tries to UPDATE the row. There's no WHERE clause in the UPDATE that'd make that fail; it's perectly fine for two concurrent sessions to UPDATE invoice_job SET status = 'working' WHERE node = 42 or whatever. The second update will happily run and commit once the first update succeeds.
You could also make it safe by repeating the WHERE clause in the UPDATE
UPDATE invoice_job SET status = 'working', date_time_start = now(),
node = $ip
WHERE id = (SELECT id FROM invoice_job WHERE status = 'created' ORDER BY id LIMIT 1)
AND status = 'created'
RETURNING *
... but this will often return zero rows under high concurrency.
In fact it will return zero rows for all but one of a set of concurrent executions, so it's no better than a serial queue worker. This is true of most of the other "clever" tricks people use to try to do concurrent queues, and one of the main reasons SKIP LOCKED was introduced.
The fact that you only noticed this problem now tells me that you would actually be fine with a simple, serial queue dispatch where you LOCK TABLE before picking the first row. But SKIP LOCKED will scale better if your workload grows.

Postgres 9.4 detects Deadlock when read-modify-write on single table

We have an application with a simple table
given_entity{
UUID id;
TimeStamp due_time;
TimeStamp process_time;
}
This is a spring boot (1.2.5.RELEASE) application that uses spring-data-jpa.1.2.5.RELEASE with hibernate-4.3.10.FINAL as jpa provier.
We have 5 instances of this application with each of them having a scheduler running every 2 second and querying the database for rows that have a due_time of last 2 mins until now that are not yet processed;
SELECT * FROM given_entity
WHERE process_time is null and due_time between now() and NOW() - INTERVAL '2 minutes'
FOR UPDATE
Requirement is each row of above table gets successfully processed by exactly one of application instances.
Then the application instance processes these rows and update its process_time field in one transaction.
This may or may not take more than 2 seconds, which is scheduler interval.
Also we don't have any index but PK index on this table.
Second point worth noting is that these instances might insert rows this table which is called separately by clients.
Problem: in the logs I see this message from postgresql (rarely but it happens)
ERROR: deadlock detected
Detail: Process 10625 waits for ShareLock on transaction 25382449; blocked by process 10012.
Process 10012 waits for ShareLock on transaction 25382448; blocked by process 12238.
Process 12238 waits for AccessExclusiveLock on tuple (1371,45) of relation 19118 of database 19113; blocked by process 10625.
Hint: See server log for query details.
Where: while locking tuple (1371,45) in relation "given_entity"
Question:
How does this happen?
I checked postgresql locks and searched internet. I didn't find anything that says deadlock is possible on only one simple table.
I also couldn't reproduce this error using test.
Process A tries to lock row 1 followed by row 2. Meanwhile, process B tries to lock row 2 then row 1. That's all it takes to trigger a deadlock.
The problem is that the row locks are acquired in an indeterminate order, because the SELECT returns its rows in an indeterminate order. And avoiding this is just a matter of ensuring that all processes agree on an order when locking rows, i.e.:
SELECT * FROM given_entity
WHERE process_time is null and due_time between now() and NOW() - INTERVAL '2 minutes'
ORDER BY id
FOR UPDATE
In Postgres 9.5+, you can simply ignore any row which is locked by another process using FOR UPDATE SKIP LOCKED.
This can easily happen.
There are probably several rows that satisfy the condition
due_time BETWEEN now() AND now() - INTERVAL '2 minutes'
so it can easily happen that the SELECT ... FOR UPDATE finds and locks one row and then is blocked locking the next row. Remember – for a deadlock it is not necessary that more than one table is involved, it is enough that more than one lockable resource is involved. In your case, those are two different rows in the given_entity table.
It may even be that the deadlock happens between two of your SELECT ... FOR UPDATE statements.
Since you say that there is none but the primary key index on the table, the query has to perform a sequential scan. In PostgreSQL, there is no fixed order for rows returned from a sequential scan. Rather, if two sequential scans run concurrently, the second one will “piggy-back” on the first and will start scanning the table at the current location of the first sequential scan.
You can check if that is the case by setting the parameter synchronize_seqscans to off and see if the deadlocks vanish. Another option would be to take a SHARE ROW EXCLUSIVE lock on the table before you run the statement.
Switch on hibernate batch updates in your application.properties
hibernate.batch.size=100
hibernate.order_updates=true
hibernate.order_inserts=true
hibernate.jdbc.fetch_size = 400

Postgres deadlock with read_commited isolation

We have noticed a rare occurrence of a deadlock on a Postgresql 9.2 server on the following situation:
T1 starts the batch operation:
UPDATE BB bb SET status = 'PROCESSING', chunk_id = 0 WHERE bb.status ='PENDING'
AND bb.bulk_id = 1 AND bb.user_id IN (SELECT user_id FROM BB WHERE bulk_id = 1
AND chunk_id IS NULL AND status ='PENDING' LIMIT 2000)
When T1 commits after a few hundred milliseconds or so (BB has many millions of rows), multiple threads begin new Transactions (one transaction per thread) that read items from BB, do some processing and update them in batches of 50 or so with the queries:
For select:
SELECT *, RANK() as rno OVER(ORDER BY user_id) FROM BB WHERE status = 'PROCESSING' AND bulk_id = 1 and rno = $1
And Update:
UPDATE BB set datetime=$1, status='DONE', message_id=$2 WHERE bulk_id=1 AND user_id=$3
(user_id, bulk_id have a UNIQUE constraint).
Due to an external to the situation problem, another transaction T2 executes the same query with T1 almost immediately after T1 has committed (the initial batch operation where items are marked as 'PROCESSING').
UPDATE BB bb SET status = 'PROCESSING', chunk_id = 0 WHERE bb.status ='PENDING'
AND bb.bulk_id = 1 AND bb.user_id IN (SELECT user_id FROM BB WHERE bulk_id = 1
AND chunk_id IS NULL AND status ='PENDING' LIMIT 2000)
However although these items are marked as 'PROCESSING' this query deadlocks with some of the updates (which are done in batches as i said) off the worker threads. To my understanding this should not happen with READ_COMMITTED isolation level (default) that we use. I am sure that T1 has committed because the worker threads execute after it has done so.
edit: One thing i should clear up is that T2 starts after T1 but before it commits. However due to a write_exclusive tuple lock that we acquire with a SELECT for UPDATE on the same row (that is not affected by any of the above queries), it waits for T1 to commit before it runs the batch update query.
When T1 commits after a few hundred milliseconds or so (BB has many millions of rows), multiple threads begin new Transactions (one transaction per thread) that read items from BB, do some processing and update them in batches of 50 or so with the queries:
This strikes me as a concurrency problem. I think you are far better off to have one transaction read the rows and hand them off to worker processes, and then update them in batches, when they come back. Your fundamental problem is going to be that these rows are effectively working on uncertain state, holding the rows during transactions, and the like. You have to handle rollbacks and so forth separately, and consequently the locking is a real problem.
Now, if that solution is not possible, I would have a separate locking table. In this case, each thread spins up separately, locks the locking table, claims a bunch of rows, inserts records into the locking table, and commits. In this way each one thread has claimed records. Then they can work on their record sets, update them, etc. You may want to have a process which periodically clears out stale locks.
In essence your problem is that rows go from state A -> processing -> state B and may be rolled back. Since the other threads have no way of knowing what rows are processing and by which threads, you can't safely allocate records. One option is to change the model to:
state A -> claimed state -> processing -> state B. However you have to have some way of ensuring that rows are effectively allocated and that the threads know which rows have been allocated to eachother.