PostgreSQL benchmarking over a RAMdisk? - postgresql

I have been considering the idea of moving to a RAMdisk for a while. I know its risks, but just wanted to do a little benchmark. I just had two questions: (a) when reading the query plan, will it still differentiate between disk and buffers hits? If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
(b) a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take? Is it the same as usual e.g. COPY command?

I do not recommend using RAM disks in PostgreSQL for persistent storage. With careful tuning, you can get PostgreSQL not to use more disk I/O than what is required to make your data persistent.
I recommend doing this:
Have more RAM in your machine than the size of the database.
Define shared_buffers big enough to contain the database (on Linux, define memory hugepages to contain them).
Increase checkpoint_timeout and max_wal_size to get fewer checkpoints.
Set synchronous_commit = off to keep PostgreSQL from syncing WAL to disk on every commit.
If you are happy to lose all your data in the case of a crash, define your tables UNLOGGED. The data will survive a normal shutdown.
Anyway, to answer your questions:
(a) You should set seq_page_cost and random_page_cost way lower to tell PostgreSQL how fast your storage is.
(b) You could run backups with either pg_dump or pg_basebackup, they don't care what kind of storage you have got.

when reading the query plan, will it still differentiate between disk and buffers hits?
It never distinguished between them in the first place. It distinguishes between "hit" and "read", but the "read" can't tell which are truly from disk and which are from OS/FS cache.
PostgreSQL has no idea you are running on a RAM disk, so will continue to report those as it always has.
If so, should I assume that both are equally expensive or should I assume that there is a difference between them?
This is a question that should be answered through your benchmarking. On some systems, memory can be read-ahead from main memory into the faster caches, making sequential reads still faster than random reads. If you care, you will have to benchmark it on your own system.
Reading data from RAM into shared_buffers is still surprisingly expensive due to things like lock management. So as a rough starting point, maybe seq_page_cost=0.1 and random_page_cost=0.15.
a RAM disk is not persistent, but if I want to export some results to persistent storage, are there some precautions I would need to take?
The risk would be that your system crashes before the export has finished. But what precaution can you take against that?

Related

Postgres configuration for use in embedded?

I have the same 5000 key/value pairs being read/written continuously (every 150ms or so) on a Debian system equivalent to a Raspberry Pi 3.
I don't care about persisting this data, it's recreated whenever my application server is launched.
Initially I used SQLite for this, using an in-memory table. However, now I want to access the data from multiple processes (using a tmpfs didn't work out great) and even from a remote client, as well as add an HTTP API, use LISTEN/NOTIFY for change notifications, so I'd like to switch to PG which is more appropriate for these.
Given these circumstances:
small dataset that fits in RAM
no need for persistence
low power PC
running 24/7 forever
don't want to thrash the flash storage
...what would be a good approach to configuring PG?
I found this 10yo question and the last update was 5 years ago saying to use a 3rd party extension, which I'm not too excited about.
You should create few indexes apart from the primary keys and keep the fillfactor of all your tables low, perhaps around 50. That should get you HOT updates, which will reduce the need for VACUUM and the amount of data written.
You may want to reduce shared_buffers to conserve memory, but keep it big enough to contain the database.
Set synchronous_commit to off to have less disk I/O. If you are ready to ditch the database after an unclean shutdown or system crash, you can set fsync = off, but then you have to remove the cluster after each crash. If you take it that far, you could reduce the write load further by using unlogged tables.
Set checkpoint_timeout high for fewer writes.

Postgres load balance with limited hardware resources

I've got a task to do and some limited hardware resources, as always.
I need to setup postgres server with single database, with a table of largeobjects (3TB+) and a few small, heavily accessed tables (<10 GB).
I've got old physical server with ~5 TB of harddisk space, with limited CPU and RAM, I can also use much faster (in CPU and RAM) virtual server - but limited in storage.
I won't have much DELETE statements, most SELECT statements will be to recent data. There will be one simultanous connection doing all the job, client on one host only.
I see a few scenarios:
Postgres on virtual machine with remote storage (single instance)
Postgres on old hardware with local storage (single instance)
Postgres on both, with some kind of replication (high speed virtual machine for new data, low speed for older data on the old hardware)
Any other ideas?
Is it even possible to replicate just the most recent part of the postgres database?
90% of SELECT queries will be to the most recent ~5-10 gigabytes of data, but I need seamless access to the rest 2,990 TB.
What should I do? (except buying appropriate hardware;)
It doesn't really matter as long as you have enough RAM to buffer the 10GB of heavily accessed data.
You'll need some additional RAM to read large objects without pushing the 10GB out of the cache, but that shouldn't be a problem on today's machines.
If all your work is done on one connection, that sounds like there will be no high load on the database.
So I wouldn't really worry about scaling with requirements like that.
Your biggest worry should probably be how to backup 3TB of data in a reasonable time.
Edit: If you have much less memory, you should take the machine with the faster storage.
Finally I've checked several different scenarios and decided not to keep files/largeobjects in database.
Postgres with database location mounted over NFS (v4) had some lags - It was faster but it was choking for a few seconds periodically, i decided to store plain files over NFS which is significantly slower but more stable.
I'm sure there was a way to tune it, but this solution is fine too.
Postgres is used for file index and keeps their files on local harddisk.

PostgreSQL Table in memory

I created a database containing a total of 3 tables for a specific purpose. The total size of all tables is about 850 MB - very lean... out of which one single table contains about 800 MB (including index) of data and 5 million records (daily addition of about 6000 records).
The system is PG-Windows with 8 GB RAM Windows 7 laptop with SSD.
I allocated 2048MB as shared_buffers, 256MB as temp_buffers and 128MB as work_mem.
I execute a single query multiple times against the single table - hoping that the table stays in RAM (hence the above parameters).
But, although I see a spike in memory usage during execution (by about 200 MB), I do not see memory consumption remaining at at least 500 MB (for the data to stay in memory). All postgres exe running show 2-6 MB size in task manager. Hence, I suspect the LRU does not keep the data in memory.
Average query execution time is about 2 seconds (very simple single table query)... but I need to get it down to about 10-20 ms or even lesser if possible, purely because there are just too many times, the same is going to be executed and can be achieved only by keeping stuff in memory.
Any advice?
Regards,
Kapil
You should not expect postgres processes to show large memory use, even if the whole database is cached in RAM.
That is because PostgreSQL relies on buffered reads from the operating system buffer cache. In simplified terms, when PostgreSQL does a read(), the OS looks to see whether the requested blocks are cached in the "free" RAM that it uses for disk cache. If the block is in cache, the OS returns it almost instantly. If the block is not in cache the OS reads it from disk, adds it to the disk cache, and returns the block. Subsequent reads will fetch it from the cache unless it's displaced from the cache by other blocks.
That means that if you have enough free memory to fit the whole database in "free" operating system memory, you won't tend to hit the disk for reads.
Depending on the OS, behaviour for disk writes may differ. Linux will write-back cache "dirty" buffers, and will still return blocks from cache even if they've been written to. It'll write these back to the disk lazily unless forced to write them immediately by an fsync() as Pg uses at COMMIT time. When it does that it marks the cached blocks clean, but doesn't flush them. I don't know how Windows behaves here.
The point is that PostgreSQL can be running entirely out of RAM with a 1GB database, even though no PostgreSQL process seems to be using much RAM. Having shared_buffers too high just leads to double-caching and can reduce the amount of RAM available for the OS to cache blocks.
It isn't easy to see exactly what's cached in RAM because Pg relies on the OS cache. That's why I referred you to pg_fincore.
If you're on Windows and this won't work, you really just have to rely on observing disk activity. Does performance monitor show lots of uncached disk reads? Does operating system memory monitoring show lots of memory used for disk cache in the OS?
Make sure that effective_cache_size correctly reflects the RAM used for disk cache. It will help PostgreSQL choose appropriate query plans.
You are making the assumption, without apparent evidence, that the query performance you are experiencing is explained by disk read delays, and that it can be improved by in-memory caching. This may not be the case at all. You need to look at explain analyze output and system performance metrics to see what's going on.

Is this disk read speed to be expected (Amazon EBS)?

our Amazon EBS backed instance has slowed down considerably (maybe shifted physical host?).
I've checked the instance using top and the CPU use is very low when the process is activated (like 1%). Using iotop I have monitored the disk read speed of postgresql. When there is only one postgresql thread running it's reporting about a 5M/s read speed. Is this rather slow or is this in the parameters of usual disk read speeds?
Thanks
5MB/s its more or less the typical for one single hard drive. I mean sequential accesses of course. If you have only one hard disk then your CPU its fine since one hard disk its not enough to stress out your CPU probably, if you are not reaching any more speed than that even with constant queries then your hard disk its the bottleneck.

PostgreSQL In Memory Database

I want to run my PostgreSQL database server from memory. The reason is that on my new server, I have 24 GB of memory, and hardly any of it is used.
I know I can run this command to make a ramdisk:
mdmfs -s 1024m md2 /mnt
And I could theoretically have PostgreSQL store its data there. But the problem with this is that if the server crashes or reboots, the data will be gone.
Basically, I want the database to be loaded in memory at all times so that it does not have to go to the hard disk drive to read every record, since I have TONS of memory and since memory is faster than hard disk drives.
Is there a way to do this while also having PostgreSQL write to disk so I don't lose any data in case the server goes down? Or is there a way to cache all data in memory?
I'm now using streaming replication which is async. This means my MASTER could be running all in memory, with the separate SLAVE instance using traditional disk.
A machine restart would involve stopping the SLAVE, copying the postgresql data back into ramdisk and then restarting the MASTER followed by the SLAVE. This would be an interesting possibility which compares well with something like REDIS, but with the advantage of redundancy / hotstandby / backup / sql / rich toolset etc.
have you seen the Server Configuration manual chapter? check it out, then google postgresql memory tuning.
I have to believe that Postgres is written in such a way as to take full advantage of available RAM in the server. As you may have guessed by now, there's no reliable way to do this outside of Postgres.
Within Postgres, transactions assure that all operations are atomic, so if the power goes down while you are writing to a Postgres database, you will only lose that particular operation, and not the entire database.
The answer is caching. Look into adding memory to the server, then tuning PostgreSQL to maximize memory usage. Also, the file system cache will help with this, doing some of it automatically. You will be able to speed up performance, almost as if it were in memory except for the first hit, while not having to manage it yourself, and being able to have a database larger than the physical memory.