No performance improvement using --jobs parameter for pg_restore - postgresql

I am using pg_restore to restore the Postgresql database backed up using pg_dump. (both the backup and restore is happening on PostgreSQL 12). The backup is taken using the custom format option --format=custom.
The restore for 10 GB backup is taking almost 30 mins. The time to restore increases significantly as the size of the backup increases. So I tried out the --jobs parameter for improving the restore times.
As per the documentation, concurrent connections will be used to restore the database objects. I verified the output of restore and I could verify that there were parallel threads started equal to the value of the --jobs parameter. However the time to restore has not improved with any value of the --jobs parameter.
I am aware that the performance depends on the hardware infrastructure. But the machine has 16 vcpus and 32GB RAM.
I have also tried tuning Postgres as mentioned in the blog with following configurations but still no improvements in restore times.
work_mem = 32MB
shared_buffers = 4GB
maintenance_work_mem = 2GB
full_page_writes = off
autovacuum = off
wal_buffers = -1
Is there anything that I have missed? How do I get an improvement in the restore time?

There are several things that could be the problem:
pg_restore parallelizes by running several COPY and CREATE INDEX commands concurrently.
Now if your database has one large table with a single large index, parallelization won't help you.
Perhaps your I/O system is at its limit. Then running processes in parallel won't improve the performance.
If you have lots of large objects, that is known to slow down processing, and I am not sure if parallelization helps or not.
Don't ever set full_page_writes or autovacuum to off unless you set them back to on after restore and are ready to dump the database cluster in the event of a crash. I doubt that the performance gain is worth it, particularly for full_page_writes.
One parameter that you forgot is max_wal_size. If you raise that, it will help write performance.
Apart from that, you have to find out where the bottleneck is before you can fix it.
How about using a different backup method like pg_basebackup that is normally faster?

Related

Postgres configuration for use in embedded?

I have the same 5000 key/value pairs being read/written continuously (every 150ms or so) on a Debian system equivalent to a Raspberry Pi 3.
I don't care about persisting this data, it's recreated whenever my application server is launched.
Initially I used SQLite for this, using an in-memory table. However, now I want to access the data from multiple processes (using a tmpfs didn't work out great) and even from a remote client, as well as add an HTTP API, use LISTEN/NOTIFY for change notifications, so I'd like to switch to PG which is more appropriate for these.
Given these circumstances:
small dataset that fits in RAM
no need for persistence
low power PC
running 24/7 forever
don't want to thrash the flash storage
...what would be a good approach to configuring PG?
I found this 10yo question and the last update was 5 years ago saying to use a 3rd party extension, which I'm not too excited about.
You should create few indexes apart from the primary keys and keep the fillfactor of all your tables low, perhaps around 50. That should get you HOT updates, which will reduce the need for VACUUM and the amount of data written.
You may want to reduce shared_buffers to conserve memory, but keep it big enough to contain the database.
Set synchronous_commit to off to have less disk I/O. If you are ready to ditch the database after an unclean shutdown or system crash, you can set fsync = off, but then you have to remove the cluster after each crash. If you take it that far, you could reduce the write load further by using unlogged tables.
Set checkpoint_timeout high for fewer writes.

PostgreSQL benchmarking over a RAMdisk?

I have been considering the idea of moving to a RAMdisk for a while. I know its risks, but just wanted to do a little benchmark. I just had two questions: (a) when reading the query plan, will it still differentiate between disk and buffers hits? If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
(b) a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take? Is it the same as usual e.g. COPY command?
I do not recommend using RAM disks in PostgreSQL for persistent storage. With careful tuning, you can get PostgreSQL not to use more disk I/O than what is required to make your data persistent.
I recommend doing this:
Have more RAM in your machine than the size of the database.
Define shared_buffers big enough to contain the database (on Linux, define memory hugepages to contain them).
Increase checkpoint_timeout and max_wal_size to get fewer checkpoints.
Set synchronous_commit = off to keep PostgreSQL from syncing WAL to disk on every commit.
If you are happy to lose all your data in the case of a crash, define your tables UNLOGGED. The data will survive a normal shutdown.
Anyway, to answer your questions:
(a) You should set seq_page_cost and random_page_cost way lower to tell PostgreSQL how fast your storage is.
(b) You could run backups with either pg_dump or pg_basebackup, they don't care what kind of storage you have got.
when reading the query plan, will it still differentiate between disk and buffers hits?
It never distinguished between them in the first place. It distinguishes between "hit" and "read", but the "read" can't tell which are truly from disk and which are from OS/FS cache.
PostgreSQL has no idea you are running on a RAM disk, so will continue to report those as it always has.
If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
This is a question that should be answered through your benchmarking. On some systems, memory can be read-ahead from main memory into the faster caches, making sequential reads still faster than random reads. If you care, you will have to benchmark it on your own system.
Reading data from RAM into shared_buffers is still surprisingly expensive due to things like lock management. So as a rough starting point, maybe seq_page_cost=0.1 and random_page_cost=0.15.
a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take?
The risk would be that your system crashes before the export has finished. But what precaution can you take against that?

Increase data insert speed of PostgreSQL

I am stuck in a problem that PostgreSQL data writes are very slow.
I developed my application in Java (using JDBC) to insert data into a PostgreSQL DB. It works well on our remote development server. However, after I deploy it to the production server, it causes a problem.
The insert speed of PostgreSQL on the production server is only ~150 records/s for 200000K records, while it is ~1000 records/s for the same data set on the development server.
Firstly, I tried to change the configuration in postgresql.conf as follows:
effective_cache_size = 4GB
max_wal_size = 2GB
work_mem = 128MB
shared buffers = 512MB
After I changed the configuration and restarted, it only affects the query speed, while the insert speed does not change (~150 records/s).
I have checked my server memory info, there is a lot of free memory ~4GB. The inserter only uses 0.5% of 8GB (~40MB).
So my questions are:
Is this a problem of a storage disk, such as SSD and HDD or virtual
and physical etc.? Why is the insert speed still very slow, although I have changed the configuration? Is there any way
for increasing the insert speed?
Note: the problem does not relate to the insert query structure.
I have used the same query in the same condition elsewhere (I set up an
environment in 2 servers in the same way). I do not know why the
DEVELOPMENT server (4GB) works better than the PRODUCTION server
(8GB).
The only one of your parameters that has an influence on INSERT performance is max_wal_size. High values prevent frequent checkpoints.
Use iostat -x 1 on the database server to see how busy your disks are. If they are quite busy, you are probably I/O bottlenecked. Maybe the I/O subsystem on your test server is better?
If you are running the INSERTs in many small transactions, you may be bottlenecked by fsync to the WAL. The symptom is a busy disk with not much I/O being performed.
In that case batch the INSERTs in larger transactions. The difference you observe could then be due to different configuration: Maybe you set synchronous_commit or (horribile dictu!) fsync to off on the test server.

Postgresql DB backup Ideal practices

• What are ideal practices for taking PostgreSQL logical backup using pg_dump?
• Is it ideal to take backup from a standby/slave node? If replication lag is less than 200ms
• Is it ideal to take backup from standby/slave node, and is there any specific configuration we need to change?
• Which method is a good way for taking backups logical backup or physical backup? where DB is getting updated frequently. As a backup is taken for disaster recovery which method is the faster and better backup and disaster recovery(restore).
updated
Our current database size is 5GB and replication is on hot standby mode.
We are running the Backup script on slave node but it takes remote backup from the master node every 30 minutes.
The reason I created this question is to understand when the backup is running some COPY statements takes 6 mins to complete, even though it will not affect other transactions on DB, is there any other issues occurs if a statement is taking more time.
I thought about what you wrote and here are some ideas for you:
If you need backup which will really be consistent to some point in time then you must use pg_basebackup or pg_barman (internally uses pg_basebackup) - explanation is in 1. link below. Latest pg_basebackup 10 streams WAL logs so you backup also all changes done during backup. Of course this backup takes only the whole PG instance. On the other hand it does not lock any table. And if you do it from remote instance then it causes only small CPU load on PG instance and disk IO is not as big as some texts suggests. See links 4 about my experiences. Restoration is quite simple - see link 5.
If you use pg_dump you must understand that you have no guarantee that your backup is really consistent to the point in time - again see link 1. There is a possibility to use snapshot of the database (see links 2 and 3) but even with it you cannot count on 100% consistency. We used pg_dump only on our analytical database which loads new only 1x per day (yesterdays partitions from production database). You can speed it with parallel option (works only for directory backup format). But downside is much higher load on PG instance - higher CPU usage, much higher disk IO. Even if you run pg_dump remotely - in such case you save only disk IO for saving of backup files. Plus pg_dump needs to place read lock on tables so it can collied either with new inserts or with replication (when taken on replica). But when your database reaches hundreds of GBs then even parallel dump can takes hours and in that moment you would need to switch to pg_basebackup anyway.
pg_barman is "comfortable version" of pg_basebackup + it allows you to prevent data loss even when your PG instance crashes very badly. Setting it to work requires more changes but it is definitely worth it. You will have to set WAL log archiving (see link 6) and if you PG is <10 you will have to set "max_wal_senders" and "max_replication_slots" (which you need for replication anyway) - everything is in pg-barman manual although description is not exactly great. pg_barman will stream and store WAL records even between backups so this way you can be sure that data loss in case of very bad crash will be almost none. But making it work can take many hours because descriptions are not exactly good. pg-barman does both backup and restoration with its commands.
Your database is 5GB big so any backup method will be quick. But you have to decide if you need point in time recovery and almost zero data loss or not - so if you will invest time to setting pg-barman or not.
Links:
PostgreSQL, Backups and everything you need to know
Review for Paper: 14-Serializable Snapshot Isolation in PostgreSQL - about snapshots
Parallel dumping of databases - example how to use snapshot
pg_basebackup experiencies
pg_basebackup - restore tar backup
Archiving WAL logs using script

Loading data to Postgres RDS is still slow after tuning parameters

We have created a RDS postgres instance (m4.xlarge) with 200GB storage (Provisioned IOPS). We are trying to upload data from company data mart to the 23 tables in RDS using DataStage. However the uploads are quite slow. It takes about 6 hours to load 400K records.
Then I started tuning the following parameters according to Best Practices for Working with PostgreSQL:
autovacuum 0
checkpoint_completion_target 0.9
checkpoint_timeout 3600
maintenance_work_mem {DBInstanceClassMemory/16384}
max_wal_size 3145728
synchronous_commit off
Other than these, I also turned off multi AZ and back-up. SSL is enabled though, not sure this will change anything. However, after all the changes, still not much improvement. DataStage is uploading data in parallel already ~12 threads. Write IOPS is around 40/sec. Is this value normal? Is there anything else I can do to speed up the data transfer?
In Postgresql, you're going to have to wait 1 full round trip (latency) for each insert statement written. This latency is the latency between the database all the way to the machine where the data is being loaded from.
In AWS you have many options to improve performance.
For starters, you can load your raw data onto an EC2 instance and start importing from there, however, you will likely not be able to use your dataStage tool unless it can be loaded directly on the ec2 instance.
You can configure dataStage to use batch processing where each insert statement actually contains many rows.. generally, the more, the faster.
disable data compression and make sure you've done everything you can to minimize latency between the two endpoints.