Loading data to Postgres RDS is still slow after tuning parameters - postgresql

We have created a RDS postgres instance (m4.xlarge) with 200GB storage (Provisioned IOPS). We are trying to upload data from company data mart to the 23 tables in RDS using DataStage. However the uploads are quite slow. It takes about 6 hours to load 400K records.
Then I started tuning the following parameters according to Best Practices for Working with PostgreSQL:
autovacuum 0
checkpoint_completion_target 0.9
checkpoint_timeout 3600
maintenance_work_mem {DBInstanceClassMemory/16384}
max_wal_size 3145728
synchronous_commit off
Other than these, I also turned off multi AZ and back-up. SSL is enabled though, not sure this will change anything. However, after all the changes, still not much improvement. DataStage is uploading data in parallel already ~12 threads. Write IOPS is around 40/sec. Is this value normal? Is there anything else I can do to speed up the data transfer?

In Postgresql, you're going to have to wait 1 full round trip (latency) for each insert statement written. This latency is the latency between the database all the way to the machine where the data is being loaded from.
In AWS you have many options to improve performance.
For starters, you can load your raw data onto an EC2 instance and start importing from there, however, you will likely not be able to use your dataStage tool unless it can be loaded directly on the ec2 instance.
You can configure dataStage to use batch processing where each insert statement actually contains many rows.. generally, the more, the faster.
disable data compression and make sure you've done everything you can to minimize latency between the two endpoints.

Related

Postgres insert slow after snapshot restore but not after restart

My setup
Postgres 11 running on an AWS EC2 t4g.xlarge instance (4 vCPU, 16GB) running Amazon Linux.
Set up to take a nightly disk snapshot (my workload doesn't require high reliability).
Database has table xtc_table_1 with ~6.3 million rows, about 3.2GB.
Scenario
To test some new data processing code, I created a new test AWS instance from the nightly snapshot of my production instance.
I create a new UNLOGGED table, and populate it with INSERT INTO holding_table_1 SELECT * FROM xtc_table_1;
It takes around 2 min 24 sec for the CREATE statement to execute.
I truncate holding_table_1 and run the CREATE statement again, and it completes in 30 sec. The ~30 second timing is consistent for successive truncates and creates of the table.
I think this may be because of some caching of data. I tried restarting Postgres service, then rebooting the AWS instance (after stopping postgres with sudo service postgresql stop), then stopping and starting the AWS instance. However, it's still ~30 sec to create the table.
If I rebuild a new instance from the snapshot, the first time I run the CREATE statement it's back to the ~2m+ time.
Similar behavior for other tables xtc_table_2, xtc_table_3.
Hypothesis
After researching and finding this answer, I wonder if what's happening is that the disk snapshot contains some WAL data that is being replayed the first time I do anything with xtc_table_n. And that subsequently, because Postgres was shut down "nicely" there is no WAL to playback.
Does this sound plausible?
I don't know enough about Postgres internals to be sure. I would have imagined that any WAL playback would happen on starting up postgres, but maybe it happens at the individual table level the first time a table is touched?
Knowing the reason is more than just theoretical; I'm using the test instance to do some tuning on some processing code, and need to be confident in having a consistent baseline to measure from.
Let me know if more information is needed about my setup or what I'm doing.
#jellycsc's suggestion was correct; adding more info here in case it's helpful to anyone else.
The problem I was encountering was not a postgres issue at all, but because of the way AWS handles volumes and snapshots.
From this page:
For volumes that were created from snapshots, the storage blocks must
be pulled down from Amazon S3 and written to the volume before you can
access them. This preliminary action takes time and can cause a
significant increase in the latency of I/O operations the first time
each block is accessed. Volume performance is achieved after all
blocks have been downloaded and written to the volume.
I used the fio utility as described in the linked AWS page to initialize the restored volume, and first-time performance was consistent with subsequent query times.

Postgres configuration for use in embedded?

I have the same 5000 key/value pairs being read/written continuously (every 150ms or so) on a Debian system equivalent to a Raspberry Pi 3.
I don't care about persisting this data, it's recreated whenever my application server is launched.
Initially I used SQLite for this, using an in-memory table. However, now I want to access the data from multiple processes (using a tmpfs didn't work out great) and even from a remote client, as well as add an HTTP API, use LISTEN/NOTIFY for change notifications, so I'd like to switch to PG which is more appropriate for these.
Given these circumstances:
small dataset that fits in RAM
no need for persistence
low power PC
running 24/7 forever
don't want to thrash the flash storage
...what would be a good approach to configuring PG?
I found this 10yo question and the last update was 5 years ago saying to use a 3rd party extension, which I'm not too excited about.
You should create few indexes apart from the primary keys and keep the fillfactor of all your tables low, perhaps around 50. That should get you HOT updates, which will reduce the need for VACUUM and the amount of data written.
You may want to reduce shared_buffers to conserve memory, but keep it big enough to contain the database.
Set synchronous_commit to off to have less disk I/O. If you are ready to ditch the database after an unclean shutdown or system crash, you can set fsync = off, but then you have to remove the cluster after each crash. If you take it that far, you could reduce the write load further by using unlogged tables.
Set checkpoint_timeout high for fewer writes.

Increase data insert speed of PostgreSQL

I am stuck in a problem that PostgreSQL data writes are very slow.
I developed my application in Java (using JDBC) to insert data into a PostgreSQL DB. It works well on our remote development server. However, after I deploy it to the production server, it causes a problem.
The insert speed of PostgreSQL on the production server is only ~150 records/s for 200000K records, while it is ~1000 records/s for the same data set on the development server.
Firstly, I tried to change the configuration in postgresql.conf as follows:
effective_cache_size = 4GB
max_wal_size = 2GB
work_mem = 128MB
shared buffers = 512MB
After I changed the configuration and restarted, it only affects the query speed, while the insert speed does not change (~150 records/s).
I have checked my server memory info, there is a lot of free memory ~4GB. The inserter only uses 0.5% of 8GB (~40MB).
So my questions are:
Is this a problem of a storage disk, such as SSD and HDD or virtual
and physical etc.? Why is the insert speed still very slow, although I have changed the configuration? Is there any way
for increasing the insert speed?
Note: the problem does not relate to the insert query structure.
I have used the same query in the same condition elsewhere (I set up an
environment in 2 servers in the same way). I do not know why the
DEVELOPMENT server (4GB) works better than the PRODUCTION server
(8GB).
The only one of your parameters that has an influence on INSERT performance is max_wal_size. High values prevent frequent checkpoints.
Use iostat -x 1 on the database server to see how busy your disks are. If they are quite busy, you are probably I/O bottlenecked. Maybe the I/O subsystem on your test server is better?
If you are running the INSERTs in many small transactions, you may be bottlenecked by fsync to the WAL. The symptom is a busy disk with not much I/O being performed.
In that case batch the INSERTs in larger transactions. The difference you observe could then be due to different configuration: Maybe you set synchronous_commit or (horribile dictu!) fsync to off on the test server.

Postgres load balance with limited hardware resources

I've got a task to do and some limited hardware resources, as always.
I need to setup postgres server with single database, with a table of largeobjects (3TB+) and a few small, heavily accessed tables (<10 GB).
I've got old physical server with ~5 TB of harddisk space, with limited CPU and RAM, I can also use much faster (in CPU and RAM) virtual server - but limited in storage.
I won't have much DELETE statements, most SELECT statements will be to recent data. There will be one simultanous connection doing all the job, client on one host only.
I see a few scenarios:
Postgres on virtual machine with remote storage (single instance)
Postgres on old hardware with local storage (single instance)
Postgres on both, with some kind of replication (high speed virtual machine for new data, low speed for older data on the old hardware)
Any other ideas?
Is it even possible to replicate just the most recent part of the postgres database?
90% of SELECT queries will be to the most recent ~5-10 gigabytes of data, but I need seamless access to the rest 2,990 TB.
What should I do? (except buying appropriate hardware;)
It doesn't really matter as long as you have enough RAM to buffer the 10GB of heavily accessed data.
You'll need some additional RAM to read large objects without pushing the 10GB out of the cache, but that shouldn't be a problem on today's machines.
If all your work is done on one connection, that sounds like there will be no high load on the database.
So I wouldn't really worry about scaling with requirements like that.
Your biggest worry should probably be how to backup 3TB of data in a reasonable time.
Edit: If you have much less memory, you should take the machine with the faster storage.
Finally I've checked several different scenarios and decided not to keep files/largeobjects in database.
Postgres with database location mounted over NFS (v4) had some lags - It was faster but it was choking for a few seconds periodically, i decided to store plain files over NFS which is significantly slower but more stable.
I'm sure there was a way to tune it, but this solution is fine too.
Postgres is used for file index and keeps their files on local harddisk.

PostgreSQL In Memory Database

I want to run my PostgreSQL database server from memory. The reason is that on my new server, I have 24 GB of memory, and hardly any of it is used.
I know I can run this command to make a ramdisk:
mdmfs -s 1024m md2 /mnt
And I could theoretically have PostgreSQL store its data there. But the problem with this is that if the server crashes or reboots, the data will be gone.
Basically, I want the database to be loaded in memory at all times so that it does not have to go to the hard disk drive to read every record, since I have TONS of memory and since memory is faster than hard disk drives.
Is there a way to do this while also having PostgreSQL write to disk so I don't lose any data in case the server goes down? Or is there a way to cache all data in memory?
I'm now using streaming replication which is async. This means my MASTER could be running all in memory, with the separate SLAVE instance using traditional disk.
A machine restart would involve stopping the SLAVE, copying the postgresql data back into ramdisk and then restarting the MASTER followed by the SLAVE. This would be an interesting possibility which compares well with something like REDIS, but with the advantage of redundancy / hotstandby / backup / sql / rich toolset etc.
have you seen the Server Configuration manual chapter? check it out, then google postgresql memory tuning.
I have to believe that Postgres is written in such a way as to take full advantage of available RAM in the server. As you may have guessed by now, there's no reliable way to do this outside of Postgres.
Within Postgres, transactions assure that all operations are atomic, so if the power goes down while you are writing to a Postgres database, you will only lose that particular operation, and not the entire database.
The answer is caching. Look into adding memory to the server, then tuning PostgreSQL to maximize memory usage. Also, the file system cache will help with this, doing some of it automatically. You will be able to speed up performance, almost as if it were in memory except for the first hit, while not having to manage it yourself, and being able to have a database larger than the physical memory.