Performance issues with Replace Function in T-SQL - tsql

I have a large table that i am working on and for nearly all of the columns i need to use the replace statement to remove single and double quotes. The code looks like this:
SET QUOTED_IDENTIFIER ON
Update dbo.transactions set transaction_name1 = Replace(transaction_name1,'''','')
Update dbo.transactions set transaction_name2 = Replace(transaction_name2,'''','')
Update dbo.transactions set transaction_name3 = Replace(transaction_name3,'''','')
Update dbo.transactions set transaction_name4 = Replace(transaction_name4,'''','')
Update dbo.transactions set transaction_name5 = Replace(transaction_name5,'''','')
I have not put an index on the table as was not sure exactly what column would be any good being that i'm updating nearly all the columns. If i sorted the table asc by the primary key would that help increase performance?
Over then that the statements have been running for over 2 hours with no error messages and wondered if there is a solution to this performance issue, besides the usual more hardware changes? If someone could advise on ways of increasing performance of the script.
Cheers,
Peter

You can make this a single UPDATE statement:
UPDATE transactions SET
transaction_name1 = Replace(transaction_name1,'''',''),
transaction_name2 = Replace(transaction_name2,'''','')
... (and so on)
That would likely improve the performance by something approaching a factor of 5.
Edit:
Since this is a one shot thing on a huge dataset (90MM rows), I suggest adding in a where clause and running it in batches.
If your transactions have a primary key, partition the updates on that, doing maybe 500k at once.
Do this in a loop with explicit transactions to keep your log use to a minimum:
DECLARE #BaseID INT, #BatchSize INT
SELECT #BaseID = MAX(YourKey), #BatchSize = 500000 FROM transactions
WHILE #BaseID > 0 BEGIN
PRINT 'Updating from ' + CAST(#BaseID AS VARCHAR(20))
-- perform update
UPDATE transactions SET
transaction_name1 = Replace(transaction_name1,'''',''),
transaction_name2 = Replace(transaction_name2,'''','')
-- ... (and so on)
WHERE YourKey BETWEEN #BaseID - #BatchSize AND #BaseID
SET #BaseID = #BaseID - #BatchSize - 1
END
Another note:
If the quotes must not appear in your data, you can create a check constraint to keep them out. It's a last ditch effort as any app attempting to put them in would need to handle a database exception, but it will keep your data clean. Something like this might do it:
ALTER TABLE transactions
ADD CONSTRAINT CK_NoQuotes CHECK(
CHARINDEX('''',transaction_name1)=0 AND
CHARINDEX('''',transaction_name2)=0 AND
-- and so on...
)

You can combine the statements, that might be a bit faster:
SET QUOTED_IDENTIFIER ON
UPDATE dbo.transactions
SET transaction_name1 = REPLACE(transaction_name1,'''',''),
transaction_name2 = REPLACE(transaction_name2,'''',''),
transaction_name3 = REPLACE(transaction_name3,'''',''),
transaction_name4 = REPLACE(transaction_name4,'''',''),
transaction_name5 = REPLACE(transaction_name5,'''','')
Also, check out the estimated execution plan.
It might give you a useful advice in how to optimize your database / query (it's a little square button in the button bar of SQL Management Studio).

You might try making this only a single UPDATE and only updating the rows that need it:
UPDATE dbo.transactions
SET transaction_name1 = REPLACE(transaction_name1,'''',''),
transaction_name2 = REPLACE(transaction_name2,'''',''),
transaction_name3 = REPLACE(transaction_name3,'''',''),
transaction_name4 = REPLACE(transaction_name4,'''',''),
transaction_name5 = REPLACE(transaction_name5,'''','')
WHERE
CHARINDEX('''',transaction_name1)>0
OR CHARINDEX('''',transaction_name2)>0
OR CHARINDEX('''',transaction_name3)>0
OR CHARINDEX('''',transaction_name4)>0
OR CHARINDEX('''',transaction_name5)>0

Related

Postgresql and dblink: how do I do an UPDATE FROM?

Here's what works already, but it's using a loop:
(I am updating the nickname, slug field on the remote table for each row in a local table)
DECLARE
row_ record;
rdbname_ varchar;
....
/* select from local */
FOR row_ IN SELECT rdbname, objectvalue1 as keyhash, cvalue1 as slug, cvalue2 as nickname
FROM bme_tag
where rdbname = rdbname_
and tagtype = 'NAME'
and wkseq = 0
LOOP
/* update remote */
PERFORM dblink_exec('sysdb',
format(
'update bme_usergroup
set nickname = %L
,slug = %L
where rdbname = %L
and wkseq = 0
and keyhash = %L'
, row_.nickname, row_.slug, row_.rdbname, row_.keyhash)
);
END LOOP;
Now, what I would like to do instead is to do a bulk UPDATE (remote) FROM (local)
PERFORM dblink_exec('sysdb',
'update (remote)bme_usergroup
set nickname = bme_tag.cvalue2, slug=bme_tag.cvalue1
from (local).bme_tag s
where bme_usergroup.rdbname = %L
and bme_usergroup.wkseq = 0
and bme_usergroup.keyhash = s.keyhash
and bme_usergroup.rdbname = s.rdbname
)
I've gotten this far by looking a various solutions (postgresql: INSERT INTO ... (SELECT * ...)) and I know how to separate the remote and local tables of the query in the context of SELECT, DELETE and even INSERT/SELECT. And I can do that direct update with bind variables too. But how about UPDATE FROM?
If it's not possible, should I look into Postgres's FOREIGN TABLE or something similar?
The local and remote db are both on the same Postgres server. One additional bit of information, if it matters, is that either database may be dropped and restored separately from the other, and I'd prefer a lightweight solution that doesn't take a lot of configuration each time to reestablish communication.
Yes, you should use foreign tables with postgres_fdw.
That way you could just write your UPDATE statement like you would for a local table.
This should definitely be faster, but you might still be exchanging a lot of data between the databases.
If that's an option, it will probably be fastest to run the statement on the database where the updated table is and define the other table as a foreign table. That way you will probably avoid fetching and then sending the table data.
Use EXPLAIN to see what exactly happens!

Force a "lock" with Postgres and GO

I am new to Postgres so this may be obvious (or very difficult, I am not sure).
I would like to force a table or row to be "locked" for at least a few seconds at a time. Which will cause a second operation to "wait".
I am using golang with "github.com/lib/pq" to interact with the database.
The reason I need this is because I am working on a project that monitors postgresql. Thanks for any help.
You can also use select ... for update to lock a row or rows for the length of the transaction.
Basically, it's like:
begin;
select * from foo where quatloos = 100 for update;
update foo set feens = feens + 1 where quatloos = 100;
commit;
This will execute an exclusive row-level lock on foo table rows where quatloos = 100. Any other transaction attempting to access those rows will be blocked until commit or rollback has been issued once the select for update has run.
Ideally, these locks should live as short as possible.
See: https://www.postgresql.org/docs/current/static/explicit-locking.html

Getting Affected Rows by UPDATE statement in RAW plpgsql

This has been asked multiple times here and here, but none of the answers are suitable in my case because I do not want to execute my update statement in a PL/PgSQL function and use GET DIAGNOSTICS integer_var = ROW_COUNT.
I have to do this in raw SQL.
For instance, in MS SQL SERVER we have ##ROWCOUNT which could be used like the following :
UPDATE <target_table>
SET Proprerty0 = Value0
WHERE <predicate>;
SELECT <computed_value_columns>
FROM <target>
WHERE ##ROWCOUNT > 0;
In one roundtrip to the database I know if the update was successfull and get the calculated values back.
What could be used instead of '##ROWCOUNT' ?
Can someone confirm that this is in fact impossible at this time ?
Thanks in advance.
EDIT 1 : I confirm that I need to use raw SQL (I wrote "raw plpgsql" in the original description).
In an attempt to make my question clearer please consider that the update statement affects only one row and think about optimistic concurrency:
The client did a SELECT Statement at first.
He builds the UPDATE and knows which database computed columns are to be included in the SELECT clause. Among other things, the predicate includes a timestamp that is computed each time the rows is updated.
So, if we have 1 row returned then everything is OK. If no row is returned then we know that there was a previous update and the client may need to refresh the data before trying to update clause again. This is why we need to know how many rows where affected by the update statement before returning computed columns. No row should be returned if the update fails.
What you want is not currently possible in the form that you describe, but I think you can do what you want with UPDATE ... RETURNING. See UPDATE ... RETURNING in the manual.
UPDATE <target_table>
SET Proprerty0 = Value0
WHERE <predicate>
RETURNING Property0;
It's hard to be sure, since the example you've provided is so abstract as to be somewhat meaningless.
You can also use a wCTE, which allows more complex cases:
WITH updated_rows AS (
UPDATE <target_table>
SET Proprerty0 = Value0
WHERE <predicate>
RETURNING row_id, Property0
)
SELECT row_id, some_computed_value_from_property
FROM updated_rows;
See common table expressions (WITH queries) and depesz's article on wCTEs.
UPDATE based on some added detail in the question, here's a demo using UPDATE ... RETURNING:
CREATE TABLE upret_demo(
id serial primary key,
somecol text not null,
last_updated timestamptz
);
INSERT INTO upret_demo (somecol, last_updated) VALUES ('blah',current_timestamp);
UPDATE upret_demo
SET
somecol = 'newvalue',
last_updated = current_timestamp
WHERE last_updated = '2012-12-03 19:36:15.045159+08' -- Change to your timestamp
RETURNING
somecol || '_computed' AS a,
'totally_new_computed_column' AS b;
Output when run the 1st time:
a | b
-------------------+-----------------------------
newvalue_computed | totally_new_computed_column
(1 row)
When run again, it'll have no effect and return no rows.
If you have more complex calculations to do in the result set, you can use a wCTE so you can JOIN on the results of the update and do other complex things.
WITH upd_row AS (
UPDATE upret_demo SET
somecol = 'newvalue',
last_updated = current_timestamp
WHERE last_updated = '2012-12-03 19:36:15.045159+08'
RETURNING id, somecol, last_updated
)
SELECT
'row_'||id||'_'||somecol||', updated '||last_updated AS calc1,
repeat('x',4) AS calc2
FROM upd_row;
In other words: Use UPDATE ... RETURNING, either directly to produce the calculated rows, or in a writeable CTE for more complex cases.
Generally the answer to this question depends on the type of the driver used.
PQcmdTuples() function does what is needed, if the application uses libpq. Other libraries on top of libpq need to have some wrapper on top of this function.
For JDBC the Statement.executeUpdate() method seems to the job.
ODBC provides SQLRowCount() function for the similar purpose.

How do I do large non-blocking updates in PostgreSQL?

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
I could break it down even further with a script (using pseudocode here):
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
Column / Row
... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.
Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.
Index
To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.
When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.
Performance
For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:
ALTER TABLE orders DROP column status
, ADD column status text;
The manual (up to Postgres 10):
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]
The manual (since Postgres 11):
When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]
And:
The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:
Add new column without table lock?
To actually apply the default, consider doing it in batches:
Does PostgreSQL optimize adding columns with non-NULL DEFAULTs?
General solution
dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.
Disclaimer
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
Step-by-step instructions
The additional module dblink needs to be installed first:
How to use (install) dblink in PostgreSQL?
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
Persistent inserts in a UDF even if the function aborts
Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
Table name as a PostgreSQL function parameter
Avoid empty UPDATEs:
How do I (or can I) SELECT DISTINCT on multiple columns?
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).
You should delegate this column to another table like this:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
truncate order_status;
First of all - are you sure that you need to update all rows?
Perhaps some of the rows already have status NULL?
If so, then:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
Also, tail the PostgreSQL log to see if some performance related problems occurs.
I would use CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
Some options that haven't been mentioned:
Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.
An explanation is provided in this blog post.

How do I avoid using cursors in Sybase (T-SQL)?

Imagine the scene, you're updating some legacy Sybase code and come across a cursor. The stored procedure builds up a result set in a #temporary table which is all ready to be returned except that one of columns isn't terribly human readable, it's an alphanumeric code.
What we need to do, is figure out the possible distinct values of this code, call another stored procedure to cross reference these discrete values and then update the result set with the newly deciphered values:
declare c_lookup_codes for
select distinct lookup_code
from #workinprogress
while(1=1)
begin
fetch c_lookup_codes into #lookup_code
if ##sqlstatus<>0
begin
break
end
exec proc_code_xref #lookup_code #xref_code OUTPUT
update #workinprogress
set xref = #xref_code
where lookup_code = #lookup_code
end
Now then, whilst this may give some folks palpitations, it does work. My question is, how best would one avoid this kind of thing?
_NB: for the purposes of this example you can also imagine that the result set is in the region of 500k rows and that there are 100 distinct values of look_up_code and finally, that it is not possible to have a table with the xref values in as the logic in proc_code_xref is too arcane._
You have to have a XRef table if you want to take out the cursor. Assuming you know the 100 distinct lookup values (and that they're static) it's simple to generate one by calling proc_code_xref 100 times and inserting the results into a table
Unless you are willing to duplicate the code in the xref proc, there is no way to avoid using a cursor.
They say, that if you must use cursor, then, you must have done something wrong ;-) here's solution without cursor:
declare #lookup_code char(8)
select distinct lookup_code
into #lookup_codes
from #workinprogress
while 1=1
begin
select #lookup_code = lookup_code from #lookup_codes
if ##rowcount = 0 break
exec proc_code_xref #lookup_code #xref_code OUTPUT
delete #lookup_codes
where lookup_code = #lookup_code
end