What is the fastest way to insert rows into a PostgreSQL Database with GeoKettle? - postgresql

Let's say I have a .csv-File with 100 million rows. I import that csv-file into pentaho Kettle and want to write all rows into a PostgreSQL database. What is the fastest insert-transformation? I have tried the normal table output transformation and the PostgreSQL Bulk Loader (which is way faster than the table output). But still, it is too slow. Is there a faster way than using the PostgreSQL Bulk Loader?

Considering the fact that PostgreSQL Bulk Loader runs COPY table_name FROM STDIN - there's nothing faster from data load in postgres. Multi-value insert will be slower, just multiple insert will be slowest. So you can't make it faster.
To speed up COPY you can:
set commit_delay to 100000;
set synchronous_commit to off;
and other server side tricks (like dropping indexes before loading).
NB:
very old but still relevant depesz post
most probably won't work with pentaho Kettle,but worth of checking pgloader
update
https://www.postgresql.org/docs/current/static/runtime-config-wal.html
synchronous_commit (enum)
Specifies whether transaction commit will wait for WAL records to be
written to disk before the command returns a “success” indication to
the client. Valid values are on, remote_apply, remote_write, local,
and off. The default, and safe, setting is on. When off, there can be
a delay between when success is reported to the client and when the
transaction is really guaranteed to be safe against a server crash.
(The maximum delay is three times wal_writer_delay.) Unlike fsync,
setting this parameter to off does not create any risk of database
inconsistency: an operating system or database crash might result in
some recent allegedly-committed transactions being lost, but the
database state will be just the same as if those transactions had been
aborted cleanly. So, turning synchronous_commit off can be a useful
alternative when performance is more important than exact certainty
about the durability of a transaction.
(emphasis mine)
Also notice I recommend using SETfor the session level, so if the GeoKettle does not allow to set config before running commands on postgres, you can use pgbouncer connect_query for the specific user/database pair, or think some other trick. And if you can't do anything to set synchronous_commit per session and you decide to change it per database or user (so it would be applied to GeoKettle connection, don't forget to set it back to on after load is over.

Related

Insert data into remote DB tables from multiple databases through trigger or replication or foreign data wrapper

I need some advice about the following scenario.
I have multiple embedded systems supporting PostgreSQL database running at different places and we have a server running on CentOS at our premises.
Each system is running at remote location and has multiple tables inside its database. These tables have the same names as the server's table names, but each system has different table name than the other systems, e.g.:
system 1 has tables:
sys1_table1
sys1_table2
system 2 has tables
sys2_table1
sys2_table2
I want to update the tables sys1_table1, sys1_table2, sys2_table1 and sys2_table2 on the server on every insert done on system 1 and system 2.
One solution is to write a trigger on each table, which will run on every insert of both systems' tables and insert the same data on the server's tables. This trigger will also delete the records in the systems after inserting the data into server. The problem with this solution is that if the connection with the server is not established due to network issue than that trigger will not execute or the insert will be wasted. I have checked the following solution for this
Trigger to insert rows in remote database after deletion
The second solution is to replicate tables from system 1 and system 2 to the server's tables. The problem with replication will be that if we delete data from the systems, it'll also delete the records on the server. I could add the alternative trigger on the server's tables which will update on the duplicate table, hence the replicated table can get empty and it'll not effect the data, but it'll make a long tables list if we have more than 200 systems.
The third solution is to write a foreign table using postgres_fdw or dblink and update the data inside the server's tables, but will this effect the data inside the server when we delete the data inside the system's table, right? And what will happen if there is no connectivity with the server?
The forth solution is to write an application in python inside each system which will make a connection to server's database and write the data in real time and if there is no connectivity to the server than it will store the data inside the sys1.table1 or sys2.table2 or whatever the table the data belongs and after the re-connect, the code will send the tables data into server's tables.
Which option will be best according to this scenario? I like the trigger solution best, but is there any way to avoid the data loss in case of dis-connectivity from the server?
I'd go with the fourth solution, or perhaps with the third, as long as it is triggered from outside the database. That way you can easily survive connection loss.
The first solution with triggers has the problems you already detected. It is also a bad idea to start potentially long operations, like data replication across a network of uncertain quality, inside a database transaction. Long transactions mean long locks and inefficient autovacuum.
The second solution may actually also be an option if you you have a recent PostgreSQL versions that supports logical replication. You can use a publication WITH (publish = 'insert,update'), so that DELETE and TRUNCATE are not replicated. Replication can deal well with lost connectivity (for a while), but it is not an option if you want the data at the source to be deleted after they have been replicated.

Truncated a table in postgresql. How to recover it back?

Truncate table tablename;
How to recover it back in dbeaver
You cannot. See doc: http://www.postgresql.org/docs/9.1/static/sql-truncate.html
It has the same effect as an unqualified DELETE on each table, but
since it does not actually scan the tables it is faster. Furthermore,
it reclaims disk space immediately, rather than requiring a subsequent
VACUUM operation. This is most useful on large tables.
The space is returned to OS, it could be occupied by new data etc.
Tuncate uses file-level operations to delete the data, and this has a number of implications:
On commit you cannot recover
This is not MVCC safe. In other words, other concurrent transactions see the table as empty despite transaction isolation levels that should let them see the old rows.
This means you basically have to recover from backup and this is a great case to use to warn that backups are about more than hardware failure. They are also there in case of administrator error (and that is why replication is not a backup).

Set index statistics after insert in Firebird database

I would like to automate the process of setting index statistics in a Firebird database so that it doesn't require a database administrator to run the command, or a user to click a button.
Since the statistics only need to be recalculated after a large number of inserts or deletes, I am considering using an After Insert and After Delete trigger to keep track of how many inserts or deletes have taken place, and then run a procedure to set index statistics based on that value.
My question is whether there is anything to watch out for when setting the index statistics in this manner on a live database. To be clear, I am not rebuilding indexes, but recalculating index statistics only. It is quite possible that this would occur during a mass import or delete operation. Would calculating index statistics during a mass import or delete have the potential to cause any problems?
It is safe to recalculate index statistics on a live database, while it is in use. It is also safe to do that in PSQL, e.g. in a stored procedure. For example I'm running a scheduled batch job in the night, which executes a stored procedure recalculating statistics for all indexes.
I'm not sure if it is wise to do that in a trigger, because triggers in Firebird fire per row and not per statement, thus you have to make sure to run that in some kind of conditional branch in your PSQL body.

How to rollback an update in PostgreSQL

While editing some records in my PostgreSQL database using sql in the terminal (in ubuntu lucid), I made a wrong update.
Instead of -
update mytable set start_time='13:06:00' where id=123;
I typed -
update mytable set start_time='13:06:00';
So, all records are now having the same start_time value.
Is there a way to undo this change? There are some 500+ records in the table, and I do not know what the start_time value for each record was
Is it lost forever?
I'm assuming it was a transaction that's already committed? If so, that's what "commit" means, you can't go back.
Some data may be recoverable if you're lucky. Stop the database NOW.
Here's an answer I wrote on the same topic earlier. I hope it's helpful.
This might be too: Recoved deleted rows in postgresql .
Unless the data is absolutely critical, just restore from backups, it'll be lots easier and less painful. If you didn't have backups, consider yourself soundly thwacked.
If you catch the mistake and immediately bring down any applications using the database and take it offline, you can potentially use Point-in-Time Recovery (PITR) to replay your Write Ahead Log (WAL) files up to, but not including, the moment when the errant transaction was made. This would return the database to the state it was in prior, thus effectively 'undoing' that transaction.
As an approach for a production application database it has a number of obvious limitations, but there are circumstances in which PITR may be the best option available, especially when critical data loss has occurred. However, it is of no value if archiving was not already configured before the corruption event.
https://www.postgresql.org/docs/current/static/continuous-archiving.html
Similar capabilities exist with other relational database engines.

How to find out when data was inserted to Postgres?

I have inherited an existing Postgres database full of data. Most of the data has a 'created_date' column value. Some of the earlier data was inserted before this was being tracked.
Is there a Postgres metadata table hidden away somewhere that tracks when INSERT queries were done?
Postgres 9.5 or later
You can enable track_commit_timestamp in postgresql.conf (and restart) to start tracking commit timestamps. Then you can get a timestamp for your xmin. Related answer:
Atomically set SERIAL value when committing transaction
Postgres 9.4 or older
There is no such metadata in PostgreSQL unless you record it yourself.
You may be able to deduce some information from the row headers (HeapTupleHeaderData), in particular from the insert transaction id xmin. It holds the ID of the transaction in which the row was inserted (needed to decide visibility in PostgreSQL's MVCC model). Try (for any table):
SELECT xmin, * FROM tbl LIMIT 10;
Some limitations apply:
If the database was dumped and restored then, obviously, the information is gone - all rows are inserted in the same transaction.
If the database is huge / very old / very heavily written, then it may have gone through transaction ID wraparound, and the order of numbers in xmin is ambiguous.
But for most databases you should be able to derive:
the chronological order of INSERTs
which rows were inserted together
when there (probably) was a long period of time between inserts
No timestamp, though.
Building on Erwin Brandstetter's answer, if you have PostgreSQL 9.5 or later, the timestamps of commits are being recorded in the write-ahead log all the time, even if track_commit_timestamp is off. They are recorded there to support point-in-time recovery, where you can roll the database to an exact past state that you can specify as a date and time.
What you get by turning track_commit_timestamp on is an easier way to retrieve that information, where you can simply query with
SELECT pg_xact_commit_timestamp(xid);
where xid is the xmin from the row you care about, and it gives you the timestamp.
That's convenient, but it only works if:
track_commit_timestamp is on
it was on when the transaction committed
the transaction ID is not far enough in the past to be 'frozen'.
(PostgreSQL controls the overhead of remembering transaction IDs forever, by eventually 'freezing' old ones. That also controls how far the track_commit_timestamp-dependent functions can look back. There is another setting, vacuum_freeze_max_age, for adjusting that.)
So what do you do if you need the timestamp for a transaction that happened before you turned on track_commit_timestamp?
As long as it happened in PG 9.5 or later, the timestamp is in the write-ahead log. If you have been keeping backups sufficient for point-in-time recovery, that gives you a crude way to find the answer: you can restore a base backup from before you think it happened, set a recovery 'pause' target timestamp near where you guess it happened, connect when it pauses and query to see if it happened yet. If not, set a slightly later target, let the recovery continue, and check again. This can all be done using the backups in another PostgreSQL instance, to avoid interfering with one running production.
That is a clumsy-enough procedure you might wish you could just go back in time and tell your former self to turn track_commit_timestamp on, so it would have been on when the transaction happened that you are interested in. You can turn on track_commit_timestamp before starting the server to recover from a backup, but that doesn't quite do the trick: if it was turned off at the time of the backup, it will only begin saving timestamps for new transactions, after the ones it recovers.
It turns out it is possible to fool PostgreSQL into thinking track_commit_timestamp was on, and then start the server in recovery, and that has much the desired effect: as it replays transactions from the write-ahead log, it does remember their timestamps, and you can then use pg_xact_commit_timestamp() to query them. It will not have timestamps for anything that was in the base backup, but only for the transactions that followed the base backup and were replayed from the WAL. Still, by choosing a base backup known to be earlier than the wanted transaction, this allows the timestamp to be recovered.
There is no official tool/option to 'retroactively' set track_commit_timestamp in this way, but the (fiddly and unsupported) proof-of-concept has been discussed on pgsql-hackers.
track_commit_timestamp (boolean)
Mostly used at time of replication server setup.
Record commit time of transactions. This parameter can only be set in postgresql.conf file or on the server command line. The default value is off.
Short answer: no.
If there was, everyone would complain it was a waste of space on all the tables they didn't want to track.