Merge to insert / delete /update entire row without naming EVERY variable - tsql

I want to do something like
use mydb
go
begin tran
merge dbo.aTestTarget as T
using dbo.aTestSource as S
on (T.link = S.link)
when not matched by target and (s.code like '*I%') then
-- is there a way to do this sort of thing?
insert (T.*) values (S.*)
when matched and ...
rollback tran
go
Is there some way to do this WITHOUT defining EVERY column? I have a number of tables with 20 to 50 fields.

No, there is not. Using the * syntax is a bad practice anyway because it makes for fragile code that will be hard to maintain.
However, in SSMS you can drag&drop the Columns folder under a table into the editor to get a comma separated list of all columns for that table. That makes typing a little easier. :)

Related

Best way to prevent duplicate data on copy csv postgresql

This is more of a conceptual question because I'm planning how best to achieve our goals here.
I have a postgresql/postgis table with 5 columns. I'll be inserting/appending data into the database from a csv file every 10 minutes or so via the copy command. There will likely be some duplicate rows of data, so I'd like to copy the data from the csv file to the postgresql table but prevent any duplicate entries from getting into the table from the csv file. There are three columns, where if they are all equal, that will mean the entry is a duplicate. They are "latitude", "longitude" and "time". Should I make a composite key from all three columns? If I do that, will it just throw an error upon trying to copy the csv file into the database? I'm going to be copying the csv file automatically so I would want it to go ahead and copy the rest of the file that aren't duplicates and not copy the duplicates. Is there a way to do this?
Also, I of course want it to look for duplicates in the most efficient way. I don't need to look through the whole table (which will be quite large) for duplicates...just the past 20 minutes or so via the timestamp on the row. And I've indexed the db with the time column.
Thanks for any help!
Upsert
The Answer by Linoff is correct but can simplified a bit by Postgres 9.5 new ”UPSERT“ feature (a.k.a. MERGE). That new feature is implemented in Postgres as INSERT ON CONFLICT syntax.
Rather than explicitly check for violation of the unique index, we can let the ON CONFLICT clause detect the violation. Then we DO NOTHING, meaning we abandon the effort to INSERT without bothering to attempt an UPDATE. So if we cannot insert, we just move on to next row.
We get the same results as Linoff’s code but lose the WHERE clause.
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT idx_bigtable_col1_col2_col
DO NOTHING
;
I think I would take the following approach.
First, create an index on the three columns that you care about:
create unique index idx_bigtable_col1_col2_col3 on bigtable(col1, col2, col3);
Then, load the data into a staging table using copy. Finally, you can do:
insert into bigtable(col1, . . . )
select col1, . . .
from stagingtable st
where (col1, col2, col3) not in (select col1, col2, col3 from bigtable);
Assuming no other data modifications are going on, this should accomplish what you want. Checking for duplicates using the index should be ok from a performance perspective.
An alternative method is to emulates MySQL's "on duplicate key update" to ignore such records. Bill Karwin suggests implementing a rule in an answer to this question. The documentation for rules is here. Something similar could also be done with triggers.
The method posted by Basil Bourque was great, but there was a slight syntax error.
Based on the documentation, I modified it to the following, which works:
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT (col1)
DO NOTHING
;

How can I change all occurrences of a particular value in any column in PostgreSQL?

I have three different values in my database that represent a null: an actual null, an empty string, and a string {x:Null}. This value appears across multiple columns.
{x:Null} is normalized on the web front-end, so all these values look exactly the same although they end up ordered differently in a sort. How can I write a query that will take these values and make them actual nulls across every column and every table?
Bonus points if you can tell me how to make sure these other empty values are always inserted as nulls going forward. (Disclaimer: I have no power to grant any actual bonus points. ;)
You can query the information_schema to get a list of all tables and columns with a string type.
SELECT table_name, column_name
FROM information_schema.columns
WHERE data_type IN ('text', 'character', 'character varying')
NOTE double check first what values data_type has, I'm not sure if it will be character or char or what.
Then I would write a small program to update each column in each table. Here it is sketched out in Perl.
while( my($table, $column) = $sth->fetch ) {
my $q_table = $dbh->quote($table);
my $q_column = $dbh->quote($column);
$dbh->do(q[
UPDATE `$q_table`
SET `$q_column` = NULL
WHERE `$q_column` = '{x:Null}'
OR `$q_column` = ''
]);
}
Be sure to SQL escape $table and $column as in my sample.
Going forward, you'll have to set CONSTRAINTS on each and every column. You can use the information_schema.columns to do this as well. Something like
ALTER TABLE `$q_table` ADD CHECK(`$q_column` NOT IN ('{x:Null}', ''))
You could use a trigger to change the values to NULL, but I don't like data stores that silently change basic data for application purposes.
For new columns and tables, you'll have to remember to add that constraint. Same caveats about data_type apply.
However, it's probably a bad idea to say that no column can ever be an empty string. You might want to be bit more selective.
Another thing to note: NULL is a funny thing, its not true and its not false. You might be better off deciding that an empty string is the thing to set empty values to.
I don't think this approach is maintainable. It's scribbling an application rule all over the data layer. What if you have some data that doesn't follow that rule? And it will have to be continuously maintained for any new data schema added. Perhaps instead you should put this at your ORM layer. Or write a few stored procedures to take care of this.
Using the information_schema.columns table, write a procedural language routine which iterates through all applicable tables and columns, executing an update... set *column* = NULL...where column in ('','{x:Null}'). for each eligible column.
As for inserting these values as NULL going forward, you would have to set triggers on your tables to intercept these values and replace them with NULL.
I don't think there is any query that would do this thing for every table and every column. In principle, what you want to do is
UPDATE table SET column=NULL WHERE column='' OR column='{x:Null}';
You could try selecting data from the pg_attribute and pg_class columns to get the names of the tables and names of the columns and then generating automatically the queries. Be sure to select only those columns that contain textual data.
What if somebody has entered a genuine string '{x:Null}'? You would then change it into NULL.
However, you have done a real mistake by letting the situation to be as bad as it's currently. You should always normalize data before putting it into a database.

Insert and update records in one TSQL statement?

I have a table BigTable and a table LittleTable. I want to move a copy of some records from BigTable into LittleTable and then (for these records) set BigTable.ExportedFlag to T (indicating that a copy of the record has been moved to little table).
Is there any way to do this in one statement?
I know I can do a transaction to:
moves the records from big table based on a where clause
updates big table setting exported to T based on this same where clause.
I've also looked into a MERGE statement, which does not seem quite right, because I don't want to change values in little table, just move records to little table.
I've looked into an OUTPUT clause after the update statement but can't find a useful example. I don't understand why Pinal Dave is using Inserted.ID, Inserted.TEXTVal, Deleted.ID, Deleted.TEXTVal instead of Updated.TextVal. Is the update considered an insertion or deletion?
I found this post TSQL: UPDATE with INSERT INTO SELECT FROM saying "AFAIK, you cannot update two different tables with a single sql statement."
Is there a clean single statement to do this? I am looking for a correct, maintainable SQL statement. Do I have to wrap two statements in a single transaction?
You can use the OUTPUT clause as long as LittleTable meets the requirements to be the target of an OUTPUT ... INTO
UPDATE BigTable
SET ExportedFlag = 'T'
OUTPUT inserted.Col1, inserted.Col2 INTO LittleTable(Col1,Col2)
WHERE <some_criteria>
It makes no difference if you use INSERTED or DELETED. The only column it will be different for is the one you are updating (deleted.ExportedFlag has the before value and inserted.ExportedFlag will be T)

syntax for COPY in postgresql

INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 67544
FROM plain_contacts
Here I want to use Copy command instead of Insert command in sql to reduce the time to insert values. I fetched the data using select operation. How can i insert it into a table using Copy command in postgresql. Could you please give an example for it?. Or any other suggestion in order to achieve the reduction of time to insert the values.
As your rows are already in the database (because you apparently can SELECT them), then using COPY will not increase the speed in any way.
To be able to use COPY you have to first write the values into a text file, which is then read into the database. But if you can SELECT them, writing to a textfile is a completely unnecessary step and will slow down your insert, not increase its speed
Your statement is as fast as it gets. The only thing that might speed it up (apart from buying a faster harddisk) is to remove any potential index on contact_lists that contains the column contact_id or list_id and re-create the index once the insert is finished.
You can find the syntax described in many places, I'm sure. One of those is this wiki article.
It looks like it would basically be:
COPY plain_contacts (contact_id, 67544) TO some_file
And
COPY contacts_lists (contact_id, list_id) FROM some_file
But I'm just reading from the resources that Google turned up. Give it a try and post back if you need help with a specific problem.

How do I do large non-blocking updates in PostgreSQL?

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
I could break it down even further with a script (using pseudocode here):
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
Column / Row
... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.
Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.
Index
To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.
When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.
Performance
For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:
ALTER TABLE orders DROP column status
, ADD column status text;
The manual (up to Postgres 10):
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]
The manual (since Postgres 11):
When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]
And:
The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:
Add new column without table lock?
To actually apply the default, consider doing it in batches:
Does PostgreSQL optimize adding columns with non-NULL DEFAULTs?
General solution
dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.
Disclaimer
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
Step-by-step instructions
The additional module dblink needs to be installed first:
How to use (install) dblink in PostgreSQL?
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
Persistent inserts in a UDF even if the function aborts
Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
Table name as a PostgreSQL function parameter
Avoid empty UPDATEs:
How do I (or can I) SELECT DISTINCT on multiple columns?
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).
You should delegate this column to another table like this:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
truncate order_status;
First of all - are you sure that you need to update all rows?
Perhaps some of the rows already have status NULL?
If so, then:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
Also, tail the PostgreSQL log to see if some performance related problems occurs.
I would use CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
Some options that haven't been mentioned:
Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.
An explanation is provided in this blog post.