How to insert values in PostgreSQL faster than insert() value() functions? - perl

I have an optimization issue. At the moment I am using DBI in Perl to connect to IQ(Sybase) then load the values into a hash, I then connect to PostgreSQL and use that hash to do a line by line insert with insert() value(). This is very slow. Does anyone know of a faster way to run this? My main problem is the two different DB servers and I need to avoid a hash when inserting so some type of bulk insert would be ideal just not sure how?

Populating a PostgreSQL database is documented here. The COPY command is your fiend.
Use COPY
Remove Indexes
Remove Foreign Key Constraints
Increase maintenance_work_mem
Increase checkpoint_segments
Disable WAL Archival and Streaming Replication
Run ANALYZE Afterwards

Related

Increase copy and index speed in RDS postgresql

I'm ingesting billions of rows into a postgresql table using a copy statement in a SQL script. Once the table is built I need to add a couple indexes. This is the only thing I'm doing with the database right now, so I would like to optimize it for copying/indexing. I heard it is best to adjust the maintenance_work_mem parameter value. But when I look at the value in RDS I see:
maintenance_work_mem = GREATEST({DBInstanceClassMemory*1024/63963136},65536)
The database I am using is a db.r6g.12xlarge so it has 384GB of memory. What do you think I should set the value to? Is adjusting the parameter in configuration > parameter group the right place?

How to improve import speed on SQL Workbench/J

Tried like below, but it imports terribly slow, with speed 3 rows/sec
WbImport -file=c:/temp/_Cco_.txt
-table=myschema.table1
-filecolumns=warehouse_id,bin_id,cluster_name
---deleteTarget
-batchSize=10000
-commitBatch
WbInsert can use the COPY API of the Postgres JDBC driver.
To use it, use
WbImport -file=c:/temp/_Cco_.txt
-usePgCopy
-table=myschema.table1
-filecolumns=warehouse_id,bin_id,cluster_name
The options -batchSize and -commitBatch are ignored in that case, so you should remove them.
SQL Workbench/J will then essentially use the equivalent of a COPY ... FROM STDIN. That should be massively faster than regular INSERT statements.
This requires that the input file is formatted according to the requirements of the COPY command.
WbImport uses INSERT to load data. This is the worst way to load data into Redshift.
You should be using the COPY command for this as noted in the Redshift documentation:
"We strongly recommend using the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow."

What is the fastest way to insert rows into a PostgreSQL Database with GeoKettle?

Let's say I have a .csv-File with 100 million rows. I import that csv-file into pentaho Kettle and want to write all rows into a PostgreSQL database. What is the fastest insert-transformation? I have tried the normal table output transformation and the PostgreSQL Bulk Loader (which is way faster than the table output). But still, it is too slow. Is there a faster way than using the PostgreSQL Bulk Loader?
Considering the fact that PostgreSQL Bulk Loader runs COPY table_name FROM STDIN - there's nothing faster from data load in postgres. Multi-value insert will be slower, just multiple insert will be slowest. So you can't make it faster.
To speed up COPY you can:
set commit_delay to 100000;
set synchronous_commit to off;
and other server side tricks (like dropping indexes before loading).
NB:
very old but still relevant depesz post
most probably won't work with pentaho Kettle,but worth of checking pgloader
update
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be
written to disk before the command returns a “success” indication to
the client. Valid values are on, remote_apply, remote_write, local,
and off. The default, and safe, setting is on. When off, there can be
a delay between when success is reported to the client and when the
transaction is really guaranteed to be safe against a server crash.
(The maximum delay is three times wal_writer_delay.) Unlike fsync,
setting this parameter to off does not create any risk of database
inconsistency: an operating system or database crash might result in
some recent allegedly-committed transactions being lost, but the
database state will be just the same as if those transactions had been
aborted cleanly. So, turning synchronous_commit off can be a useful
alternative when performance is more important than exact certainty
about the durability of a transaction.
(emphasis mine)
Also notice I recommend using SETfor the session level, so if the GeoKettle does not allow to set config before running commands on postgres, you can use pgbouncer connect_query for the specific user/database pair, or think some other trick. And if you can't do anything to set synchronous_commit per session and you decide to change it per database or user (so it would be applied to GeoKettle connection, don't forget to set it back to on after load is over.

Perl: Programmatically drop PostgreSQL table index then re-create after COPY using DBD::Pg

I'm copying several tables (~1.5M records) from one data source to another, but it is taking a long time. I'm looking to speed up my use of DBD::Pg.
I'm currently using pg_getcopydata/pg_putcopydata, but I suppose that the indexes on the destination tables are slowing the process down.
I found that I can find some information on table's indexes using $dbh->statistics_info, but I'm curious if anyone has a programmatic way to dynamically drop/recreate indexes based on this information.
The programmatic way, I guess, is to submit the appropriate CREATE INDEX SQL statements via DBI that you would enter into psql.
Sometimes when copying a large table it's better to do it in this order:
create table with out indexes
copy data
add indexes

Dump postgres data with indexes

I've got a Postgres 9.0 database which frequently I took data dumps of it.
This database has a lot of indexes and everytime I restore a dump postgres starts background task vacuum cleaner (is that right?). That task consumes much processing time and memory to recreate indexes of the restored dump.
My question is:
Is there a way to dump the database data and the indexes of that database?
If there is a way, will worth the effort (I meant dumping the data with the indexes will perform better than vacuum cleaner)?
Oracle has some the "data pump" command a faster way to imp and exp. Does postgres have something similar?
Thanks in advance,
Andre
If you use pg_dump twice, once with --schema-only, and once with --data-only, you can cut the schema-only output in two parts: the first with the bare table definitions and the final part with the constraints and indexes.
Something similar can probably be done with pg_restore.
Best Practice is probably to
restore the schema without indexes
and possibly without constraints,
load the data,
then create the constraints,
and create the indexes.
If an index exists, a bulk load will make PostgreSQL write to the database and to the index. And a bulk load will make your table statistics useless. But if you load data first, then create the index, the stats are automatically up to date.
We store scripts that create indexes and scripts that create tables in different files under version control. This is why.
In your case, changing autovacuum settings might help you. You might also consider disabling autovacuum for some tables or for all tables, but that might be a little extreme.