How to implement a merge statement for DB2 ZOS using a JdbcItemWriter in Spring Batch? - spring-batch

I have a requirement to read 50K records from one database and then insert or update those records into another database. The read takes a couple seconds. But the inserts/updates for those 50K records is taking up to 23 minutes even with multithreading. I have been playing with page, fetch and batch sizes but the performance isn't improving that much.
Is there a way to implement this kind of statement in a JdbcItemWriter?
-- hv_nrows = 3
-- hv_activity(1) = 'D'; hv_description(1) = 'Dance'; hv_date(1) = '03/01/07'
-- hv_activity(2) = 'S'; hv_description(2) = 'Singing'; hv_date(2) = '03/17/07'
-- hv_activity(3) = 'T'; hv_description(3) = 'Tai-chi'; hv_date(3) = '05/01/07'
-- hv_group = 'A';
-- note that hv_group is not an array. All 3 values contain the same values
MERGE INTO RECORDS AR
USING (VALUES (:hv_activity, :hv_description, :hv_date, :hv_group)
FOR :hv_nrows ROWS)
AS AC (ACTIVITY, DESCRIPTION, DATE, GROUP)
ON AR.ACTIVITY = AC.ACTIVITY AND AR.GROUP = AC.GROUP
WHEN MATCHED
THEN UPDATE SET (DESCRIPTION, DATE, LAST_MODIFIED)
= (AC.DESCRIPTION, AC.DATE, CURRENT TIMESTAMP)
WHEN NOT MATCHED
THEN INSERT (GROUP, ACTIVITY, DESCRIPTION, DATE, LAST_MODIFIED)
VALUES (AC.GROUP, AC.ACTIVITY, AC.DESCRIPTION, AC.DATE, CURRENT TIMESTAMP)
NOT ATOMIC CONTINUE ON SQLEXCEPTION;
My idea is to send a bunch of rows to be merged at once and see if the perfomance improves.
I tried something like this:
MERGE INTO TEST.DEST_TABLE A
USING (
('00000031955190','0107737793'),
('00000118659978','0107828212'),
('00000118978436','0095878120'),
('00000122944473','0106845043')
) AS B(CDFILIAC,CDRFC)
ON A.CDFILIAC = B.CDFILIAC
WHEN MATCHED THEN
UPDATE SET
A.CDRFC=B.CDRFC,
A.CDNUMPOL=B.CDNUMPOL
And while it works for DB2 LUW, it doesn't for DB2 ZOS.

Is there a way to implement this kind of statement in a JdbcItemWriter?
The JdbcBatchItemWriter delegates to org.springframework.jdbc.core.namedparam.NamedParameterJdbcOperations. So if NamedParameterJdbcOperations from Spring Framework supports this kind of queries, Spring Batch also does. Otherwise, you would need to create a custom writer.
while it works for DB2 LUW, it doesn't for DB2 ZOS.
Spring Batch cannot help at this level, as this seems to be DB specific.

Related

How to improve EF Core Performance if interacting with Oracle and PostgreSql in a single request to handle more than 200k records

I have a .Net Core Web API endpoint which extracts the data from OracleDB and saves it into PostgreSQL DB. I am using the latest Entity Framework core and Oracle.ManagedDataAccess.Core and Npgsql.EntityFrameworkCore.PostgreSQL to connect to respective Database.
Internally in my service layer which API uses to connect to my Infra Layer, I have separate methods to call:
Call OracleRepository - In order to pull records from 5 tables (each in a diff method inside repository):
Fetch Table A data
Fetch Table B data.. and so on till Table E
Call PostgreSqlRepository - In order to store data of each table (fetched from OracleDB) into PostgreDB using Code First approach(again each in a diff method inside repository).
No of records in each table:
A - 6.7k
B - 113k
C - 56k
D - 5.8k
E - 5.3k
Now all the above steps take around 45 seconds to complete. Any suggestions to improve the performance here.
Is there is a way to fetch data asynchronously from DB and store it? I have used Transient lifetime for both Oracle and Postgre Repository and all other services in my .net Core application.
Note: Each time I am truncating my PostgreSQL tables before inserting data (using the RemoveRange method of EfCore).
When bulk-loading tables in relational databases, there's something to keep in mind.
By default, each INSERT query generates a database transaction. The ACID rules of the database require things to work correctly even if many concurrent database sessions are querying the table you're loading. So, the automatic commit is time consuming.
You can work around this when bulk loading. The best way is with the COPY command. To use this, you'll extract your data from your source database (Oracle) and write it into a temporary CSV flat file on your file system. That's fast enough. Then you'll use queries like this to copy the file into the table. That's designed to be optimized around the transaction stuff.
COPY A FROM `/the/path/to/your/file.csv` DELIMITER ',' CSV HEADER;
If you can't use COPY, know this. If you batch up your INSERTs in explicit transactions, things work better.
START TRANSACTION;
INSERT INTO A (col, col) VALUES (val, val);
INSERT INTO A (col, col) VALUES (val, val);
INSERT INTO A (col, col) VALUES (val, val);
INSERT INTO A (col, col) VALUES (val, val);
INSERT INTO A (col, col) VALUES (val, val);
COMMIT;
things work a lot faster than if you do the inserts one by one. Batches of about 100 rows will typically get you a tenfold performance improvement.
In EF Core, you can do transactions with code like so... This is pseudocode. It lacks exception handling and other things you need.
using var context = new WhateverContext();
const transactionLength = 100;
var rows = transactionLength;
var transaction = context.Database.BeginTransaction();
for each row in your input {
context.Table.Add(new Row { whatever });
if ( --rows <= 0 ) {
/* finish current transaction, start new one. */
context.SaveChanges();
transaction.Commit();
transaction.Dispose();
transaction = context.Database.BeginTransaction();
rows = transactionLength;
}
}
/* finish current transaction */
context.SaveChanges();
transaction.Commit();
transaction.Dispose();

How to store an array of values into a variable

I have a function which carries out complex load balancing, and I need to first find out the list of idle servers. After that, I have to iterate over a subset of that list, and finally I have to do a lot of complex things, so I don't want to constantly query the list over and over again. See the below as an example (Note that this is PSUEDO CODE ONLY).
CREATE OR REPLACE FUNCTION load_balance (p_company_id BIGINT, p_num_min_idle_servers BIGINT)
RETURNS VOID
AS $$
DECLARE
v_idle_server_ids BIGINT [];
v_num_idle_servers BIGINT;
v_num_idle_servers_to_retire BIGINT;
BEGIN
PERFORM * FROM server FOR UPDATE;
SELECT
count(server.id)
INTO
v_num_idle_servers
WHERE
server.company_id=p_company_id
AND
server.connection_count=0
AND
server.state = 'up';
v_num_idle_servers_to_retire = v_num_idle_servers - p_num_min_idle_servers;
SELECT
server.id
INTO
v_idle_server_ids
WHERE
server.company_id=p_company_id
AND
server.connection_count=0
AND
server.state = 'up'
ORDER BY
server.id;
FOR i in 1..v_num_idle_servers_to_retire
UPDATE
server
SET
state = 'down'
WHERE
server.id = v_idle_server_ids[i];
Question: I was thinking of getting the list of servers and looping over them one by one. Is this possible in postgres?
Note: I tried putting it all in one single query but it gets very, VERY complicated as there are multiple joins and subqueries. For example, a system but have three applications running on three different servers, where the applications know their load but the servers know their company affiliation, so I would need to join the system to the applications and the applications to the servers
Rule of thumb: if you're looping in SQL there's probably a better way.
You want to set state = 'down' until you have a certain number of idle servers.
We can do this in a single statement. Use a Common Table Expression to query your idle servers and feed that to an update from. If you do this a lot you can turn the CTE into a view.
But we need to limit how many we take down based on how many idle servers there are. We can do that with a limit. But update from doesn't take a limit, so we need a second CTE to limit the results.
Here I've hard coded company_id 1 and p_num_min_idle_servers 2.
with idle_servers as (
select id
from server
where server.company_id=1
and connection_count=0
and state = 'up'
),
idle_servers_to_take_down as (
select id
from idle_servers
-- limit doesn't work in update from
limit (select count(*) - 2 from idle_servers)
)
update server
set state = 'down'
from idle_servers_to_take_down
where server.id = idle_servers_to_take_down.id
This has the advantage of being done in one statement avoiding race conditions without having to lock the whole table.
Try it.

Postgresql and dblink: how do I do an UPDATE FROM?

Here's what works already, but it's using a loop:
(I am updating the nickname, slug field on the remote table for each row in a local table)
DECLARE
row_ record;
rdbname_ varchar;
....
/* select from local */
FOR row_ IN SELECT rdbname, objectvalue1 as keyhash, cvalue1 as slug, cvalue2 as nickname
FROM bme_tag
where rdbname = rdbname_
and tagtype = 'NAME'
and wkseq = 0
LOOP
/* update remote */
PERFORM dblink_exec('sysdb',
format(
'update bme_usergroup
set nickname = %L
,slug = %L
where rdbname = %L
and wkseq = 0
and keyhash = %L'
, row_.nickname, row_.slug, row_.rdbname, row_.keyhash)
);
END LOOP;
Now, what I would like to do instead is to do a bulk UPDATE (remote) FROM (local)
PERFORM dblink_exec('sysdb',
'update (remote)bme_usergroup
set nickname = bme_tag.cvalue2, slug=bme_tag.cvalue1
from (local).bme_tag s
where bme_usergroup.rdbname = %L
and bme_usergroup.wkseq = 0
and bme_usergroup.keyhash = s.keyhash
and bme_usergroup.rdbname = s.rdbname
)
I've gotten this far by looking a various solutions (postgresql: INSERT INTO ... (SELECT * ...)) and I know how to separate the remote and local tables of the query in the context of SELECT, DELETE and even INSERT/SELECT. And I can do that direct update with bind variables too. But how about UPDATE FROM?
If it's not possible, should I look into Postgres's FOREIGN TABLE or something similar?
The local and remote db are both on the same Postgres server. One additional bit of information, if it matters, is that either database may be dropped and restored separately from the other, and I'd prefer a lightweight solution that doesn't take a lot of configuration each time to reestablish communication.
Yes, you should use foreign tables with postgres_fdw.
That way you could just write your UPDATE statement like you would for a local table.
This should definitely be faster, but you might still be exchanging a lot of data between the databases.
If that's an option, it will probably be fastest to run the statement on the database where the updated table is and define the other table as a foreign table. That way you will probably avoid fetching and then sending the table data.
Use EXPLAIN to see what exactly happens!

How to use UPDATE ... FROM in SQLAlchemy?

I would like to write this kind of statement in SQLAlchemy / Postgres:
UPDATE slots
FROM (SELECT id FROM slots WHERE user IS NULL
ORDER BY id LIMIT 1000) AS available
SET user='joe'
WHERE id = available.id
RETURNING *;
Namely, I would like to update a limited number of rows matching specified criteria.
PG
I was able to do it this way:
limited_slots = select([slots.c.id]).\
where(slots.c.user==None).\
order_by(slots.c.id).\
limit(1000)
stmt = slots.update().returning(slots).\
values(user='joe').\
where(slots.c.id.in_(limited_slots))
I don't think its as efficient as the original SQL query, however if the database memory is large enough to hold all related segments it shouldn't make much difference.
It's been a while since I used sqlalchemy so consider the following as pseudocode:
for i in session.query(Slots).filter(Slots.user == None):
i.user = "Joe"
session.add(i)
session.commit()
I recommend the sqlalchemy ORM tutorial.

Performance issues with Replace Function in T-SQL

I have a large table that i am working on and for nearly all of the columns i need to use the replace statement to remove single and double quotes. The code looks like this:
SET QUOTED_IDENTIFIER ON
Update dbo.transactions set transaction_name1 = Replace(transaction_name1,'''','')
Update dbo.transactions set transaction_name2 = Replace(transaction_name2,'''','')
Update dbo.transactions set transaction_name3 = Replace(transaction_name3,'''','')
Update dbo.transactions set transaction_name4 = Replace(transaction_name4,'''','')
Update dbo.transactions set transaction_name5 = Replace(transaction_name5,'''','')
I have not put an index on the table as was not sure exactly what column would be any good being that i'm updating nearly all the columns. If i sorted the table asc by the primary key would that help increase performance?
Over then that the statements have been running for over 2 hours with no error messages and wondered if there is a solution to this performance issue, besides the usual more hardware changes? If someone could advise on ways of increasing performance of the script.
Cheers,
Peter
You can make this a single UPDATE statement:
UPDATE transactions SET
transaction_name1 = Replace(transaction_name1,'''',''),
transaction_name2 = Replace(transaction_name2,'''','')
... (and so on)
That would likely improve the performance by something approaching a factor of 5.
Edit:
Since this is a one shot thing on a huge dataset (90MM rows), I suggest adding in a where clause and running it in batches.
If your transactions have a primary key, partition the updates on that, doing maybe 500k at once.
Do this in a loop with explicit transactions to keep your log use to a minimum:
DECLARE #BaseID INT, #BatchSize INT
SELECT #BaseID = MAX(YourKey), #BatchSize = 500000 FROM transactions
WHILE #BaseID > 0 BEGIN
PRINT 'Updating from ' + CAST(#BaseID AS VARCHAR(20))
-- perform update
UPDATE transactions SET
transaction_name1 = Replace(transaction_name1,'''',''),
transaction_name2 = Replace(transaction_name2,'''','')
-- ... (and so on)
WHERE YourKey BETWEEN #BaseID - #BatchSize AND #BaseID
SET #BaseID = #BaseID - #BatchSize - 1
END
Another note:
If the quotes must not appear in your data, you can create a check constraint to keep them out. It's a last ditch effort as any app attempting to put them in would need to handle a database exception, but it will keep your data clean. Something like this might do it:
ALTER TABLE transactions
ADD CONSTRAINT CK_NoQuotes CHECK(
CHARINDEX('''',transaction_name1)=0 AND
CHARINDEX('''',transaction_name2)=0 AND
-- and so on...
)
You can combine the statements, that might be a bit faster:
SET QUOTED_IDENTIFIER ON
UPDATE dbo.transactions
SET transaction_name1 = REPLACE(transaction_name1,'''',''),
transaction_name2 = REPLACE(transaction_name2,'''',''),
transaction_name3 = REPLACE(transaction_name3,'''',''),
transaction_name4 = REPLACE(transaction_name4,'''',''),
transaction_name5 = REPLACE(transaction_name5,'''','')
Also, check out the estimated execution plan.
It might give you a useful advice in how to optimize your database / query (it's a little square button in the button bar of SQL Management Studio).
You might try making this only a single UPDATE and only updating the rows that need it:
UPDATE dbo.transactions
SET transaction_name1 = REPLACE(transaction_name1,'''',''),
transaction_name2 = REPLACE(transaction_name2,'''',''),
transaction_name3 = REPLACE(transaction_name3,'''',''),
transaction_name4 = REPLACE(transaction_name4,'''',''),
transaction_name5 = REPLACE(transaction_name5,'''','')
WHERE
CHARINDEX('''',transaction_name1)>0
OR CHARINDEX('''',transaction_name2)>0
OR CHARINDEX('''',transaction_name3)>0
OR CHARINDEX('''',transaction_name4)>0
OR CHARINDEX('''',transaction_name5)>0