syntax for COPY in postgresql - postgresql

INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 67544
FROM plain_contacts
Here I want to use Copy command instead of Insert command in sql to reduce the time to insert values. I fetched the data using select operation. How can i insert it into a table using Copy command in postgresql. Could you please give an example for it?. Or any other suggestion in order to achieve the reduction of time to insert the values.

As your rows are already in the database (because you apparently can SELECT them), then using COPY will not increase the speed in any way.
To be able to use COPY you have to first write the values into a text file, which is then read into the database. But if you can SELECT them, writing to a textfile is a completely unnecessary step and will slow down your insert, not increase its speed
Your statement is as fast as it gets. The only thing that might speed it up (apart from buying a faster harddisk) is to remove any potential index on contact_lists that contains the column contact_id or list_id and re-create the index once the insert is finished.

You can find the syntax described in many places, I'm sure. One of those is this wiki article.
It looks like it would basically be:
COPY plain_contacts (contact_id, 67544) TO some_file
And
COPY contacts_lists (contact_id, list_id) FROM some_file
But I'm just reading from the resources that Google turned up. Give it a try and post back if you need help with a specific problem.

Related

Best way to prevent duplicate data on copy csv postgresql

This is more of a conceptual question because I'm planning how best to achieve our goals here.
I have a postgresql/postgis table with 5 columns. I'll be inserting/appending data into the database from a csv file every 10 minutes or so via the copy command. There will likely be some duplicate rows of data, so I'd like to copy the data from the csv file to the postgresql table but prevent any duplicate entries from getting into the table from the csv file. There are three columns, where if they are all equal, that will mean the entry is a duplicate. They are "latitude", "longitude" and "time". Should I make a composite key from all three columns? If I do that, will it just throw an error upon trying to copy the csv file into the database? I'm going to be copying the csv file automatically so I would want it to go ahead and copy the rest of the file that aren't duplicates and not copy the duplicates. Is there a way to do this?
Also, I of course want it to look for duplicates in the most efficient way. I don't need to look through the whole table (which will be quite large) for duplicates...just the past 20 minutes or so via the timestamp on the row. And I've indexed the db with the time column.
Thanks for any help!
Upsert
The Answer by Linoff is correct but can simplified a bit by Postgres 9.5 new ”UPSERT“ feature (a.k.a. MERGE). That new feature is implemented in Postgres as INSERT ON CONFLICT syntax.
Rather than explicitly check for violation of the unique index, we can let the ON CONFLICT clause detect the violation. Then we DO NOTHING, meaning we abandon the effort to INSERT without bothering to attempt an UPDATE. So if we cannot insert, we just move on to next row.
We get the same results as Linoff’s code but lose the WHERE clause.
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT idx_bigtable_col1_col2_col
DO NOTHING
;
I think I would take the following approach.
First, create an index on the three columns that you care about:
create unique index idx_bigtable_col1_col2_col3 on bigtable(col1, col2, col3);
Then, load the data into a staging table using copy. Finally, you can do:
insert into bigtable(col1, . . . )
select col1, . . .
from stagingtable st
where (col1, col2, col3) not in (select col1, col2, col3 from bigtable);
Assuming no other data modifications are going on, this should accomplish what you want. Checking for duplicates using the index should be ok from a performance perspective.
An alternative method is to emulates MySQL's "on duplicate key update" to ignore such records. Bill Karwin suggests implementing a rule in an answer to this question. The documentation for rules is here. Something similar could also be done with triggers.
The method posted by Basil Bourque was great, but there was a slight syntax error.
Based on the documentation, I modified it to the following, which works:
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT (col1)
DO NOTHING
;

Merge to insert / delete /update entire row without naming EVERY variable

I want to do something like
use mydb
go
begin tran
merge dbo.aTestTarget as T
using dbo.aTestSource as S
on (T.link = S.link)
when not matched by target and (s.code like '*I%') then
-- is there a way to do this sort of thing?
insert (T.*) values (S.*)
when matched and ...
rollback tran
go
Is there some way to do this WITHOUT defining EVERY column? I have a number of tables with 20 to 50 fields.
No, there is not. Using the * syntax is a bad practice anyway because it makes for fragile code that will be hard to maintain.
However, in SSMS you can drag&drop the Columns folder under a table into the editor to get a comma separated list of all columns for that table. That makes typing a little easier. :)

Insert and update records in one TSQL statement?

I have a table BigTable and a table LittleTable. I want to move a copy of some records from BigTable into LittleTable and then (for these records) set BigTable.ExportedFlag to T (indicating that a copy of the record has been moved to little table).
Is there any way to do this in one statement?
I know I can do a transaction to:
moves the records from big table based on a where clause
updates big table setting exported to T based on this same where clause.
I've also looked into a MERGE statement, which does not seem quite right, because I don't want to change values in little table, just move records to little table.
I've looked into an OUTPUT clause after the update statement but can't find a useful example. I don't understand why Pinal Dave is using Inserted.ID, Inserted.TEXTVal, Deleted.ID, Deleted.TEXTVal instead of Updated.TextVal. Is the update considered an insertion or deletion?
I found this post TSQL: UPDATE with INSERT INTO SELECT FROM saying "AFAIK, you cannot update two different tables with a single sql statement."
Is there a clean single statement to do this? I am looking for a correct, maintainable SQL statement. Do I have to wrap two statements in a single transaction?
You can use the OUTPUT clause as long as LittleTable meets the requirements to be the target of an OUTPUT ... INTO
UPDATE BigTable
SET ExportedFlag = 'T'
OUTPUT inserted.Col1, inserted.Col2 INTO LittleTable(Col1,Col2)
WHERE <some_criteria>
It makes no difference if you use INSERTED or DELETED. The only column it will be different for is the one you are updating (deleted.ExportedFlag has the before value and inserted.ExportedFlag will be T)

Sybase stored procedure - how do I create an index on a #table?

I have a stored procedure which creates and works with a temporary #table
Some of the queries would be tremendously optimized if that temporary #table would have an index created on it.
However, creating an index within the stored procedure fails:
create procedure test1 as
SELECT f1, f2, f3
INTO #table1
FROM main_table
WHERE 1 = 2
-- insert rows into #table1
create index my_idx on #table1 (f1)
SELECT f1, f2, f3 FROM #table1 (index my_idx) WHERE f1 = 11 -- "QUERY X"
When I call the above, the query plan for "QUERY X" shows a table scan.
If I simply run the code above outside the stored procedure, the messages show the following warning:
Index 'my_idx' specified as optimizer hint in the FROM clause of table '#table1' does not exist. Optimizer will choose another index instead.
This can be resolved when running ad-hoc (outside the stored procedure) by splitting the code above in two batches by addding "go" after index creation:
create index my_idx on #table1 (f1)
go
Now, "QUERY X" query plan shows the use of index "my_idx".
QUESTION: How do I mimique running the "create index" in a separate batch when it's inside the stored procedure? I can't insert a "go" there like I do with the ad-hoc copy above. Please note that I'm aware of the solution of "split up the 'QUERY X' into a separate stored procedure" and am looking for a solution that will avoid that.
P.S. If it matters, this is on Sybase 12 (ASE 12.5.4)
UPDATE:
I have been seeing several references to "schema bumping" during my Googling before posing the question. But that doesn't seem to happen in my case.
You can create a table, populate it, create an index on it and select values
from it in the same porc and have the optimizer fully cost it based on
accurate information. This is called 'schema bumping' and has been in place
since 11.5.1.
The Sybase documentation says that you create and use a temporary index in the same stored procedure:
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.dc20023_1251/html/optimizer/X26029.htm
I think to get around this you will need to split your stored procedure into at least two parts, one to create and populate the table then build the index, and then a second one to run the select query.
I am not sure how you are getting this problem, might be in older version of Sybase, however with version 12.5.4 I tried executing the same thing as suggested by you but in my case the optimizer correctly suggested the use of index created in the stored procedure. Usually in a stored procedure we do not need to break sql into batches because else we would have been required to have a seperate batch for create table command as well.
In case we try to create index within a same batch (not in a stored procedure) we will do get the same error as specified by you above because we are trying to create an index on a table and then trying to use it within the same batch. Usually the Sybase server will compile the whole batch in one go and hence the problem. But as far as stored procedure is concerned in Sybase 12.5.4 there will be no problem.

Does Firebird need manual reindexing?

I use both Firebird embedded and Firebird Server, and from time to time I need to reindex the tables using a procedure like the following:
CREATE PROCEDURE MAINTENANCE_SELECTIVITY
ASDECLARE VARIABLE S VARCHAR(200);
BEGIN
FOR select RDB$INDEX_NAME FROM RDB$INDICES INTO :S DO
BEGIN
S = 'SET statistics INDEX ' || s || ';';
EXECUTE STATEMENT :s;
END
SUSPEND;
END
I guess this is normal using embedded, but is it really needed using a server? Is there a way to configure the server to do it automatically when required or periodically?
First, let me point out that I'm no Firebird expert, so I'm answering on the basis of how SQL Server works.
In that case, the answer is both yes, and no.
The indexes are of course updated on SQL Server, in the sense that if you insert a new row, all indexes for that table will contain that row, so it will be found. So basically, you don't need to keep reindexing the tables for that part to work. That's the "no" part.
The problem, however, is not with the index, but with the statistics. You're saying that you need to reindex the tables, but then you show code that manipulates statistics, and that's why I'm answering.
The short answer is that statistics goes slowly out of whack as time goes by. They might not deteriorate to a point where they're unusable, but they will deteriorate down from the perfect level they're in when you recreate/recalculate them. That's the "yes" part.
The main problem with stale statistics is that if the distribution of the keys in the indexes changes drastically, the statistics might not pick that up right away, and thus the query optimizer will pick the wrong indexes, based on the old, stale, statistics data it has on hand.
For instance, let's say one of your indexes has statistics that says that the keys are clumped together in one end of the value space (for instance, int-column with lots of 0's and 1's). Then you insert lots and lots of rows with values that make this index contain values spread out over the entire spectrum.
If you now do a query that uses a join from another table, on a column with low selectivity (also lots of 0's and 1's) against the table with this index of yours, the query optimizer might deduce that this index is good, since it will fetch many rows that will be used at the same time (they're on the same data page).
However, since the data has changed, it'll jump all over the index to find the relevant pieces, and thus not be so good after all.
After recalculating the statistics, the query optimizer might see that this index is sub-optimal for this query, and pick another index instead, which is more suited.
Basically, you need to recalculate the statistics periodically if your data is in flux. If your data rarely changes, you probably don't need to do it very often, but I would still add a maintenance job with some regularity that does this.
As for whether or not it is possible to ask Firebird to do it on its own, then again, I'm on thin ice, but I suspect there is. In SQL Server you can set up maintenance jobs that does this, on a schedule, and at the very least you should be able to kick off a batch file from the Windows scheduler to do something like it.
That does not reindex, it recomputes weights for indexes, which are used by optimizer to select most optimal index. You don't need to do that unless index size changes a lot. If you create the index before you add data, you need to do the recalculation.
Embedded and Server should have exactly same functionality apart the process model.
I wanted to update this answer for newer firebird. here is the updated dsql.
SET TERM ^ ;
CREATE OR ALTER PROCEDURE NEW_PROCEDURE
AS
DECLARE VARIABLE S VARCHAR(300);
begin
FOR select 'SET statistics INDEX ' || RDB$INDEX_NAME || ';'
FROM RDB$INDICES
WHERE RDB$INDEX_NAME <> 'PRIMARY' INTO :S
DO BEGIN
EXECUTE STATEMENT :s;
END
end^
SET TERM ; ^
GRANT EXECUTE ON PROCEDURE NEW_PROCEDURE TO SYSDBA;