Postgres connection should be always on? or connect before running each query? - postgresql

I am debating if I should keep my postgres connection always on, and check/re-connect before running query. Or I should connect it before run each query and close the connection as soon as it is done. Thanks!

As long as the Postgres server isn't totally jammed with connections (i.e. this is not an app that will be creating a gigantic number of perpetual connections), I don't think it's a problem to maintain the connection. I would also recommend checking the connection and handling reconnects prior to each query however. Many libraries offer ways to do this. For example, with MyBatis (Java), you can have it issue a test query each time, which can be specified. I use the lightweight SELECT 1 for this.
I would say the key thing to consider is to keep the connection idle in transaction for as little time as possible, as when that happens, it can have a variety of different impacts on performance (such as slowing down other queries, preventing high-turnover tables from being vacuumed in a timely manner, etc.). This is not to say that any time spent in idle in transaction is automatically bad, but it should be considered and minimized where possible. (e.g. if you have some calculations that are going several minutes to run, make sure to either commit or rollback prior to doing those (which one would depend on context).
If you're doing a bunch of SELECTs, and don't have anything you need to commit, I would recommend doing a rollback to help keep the idle in transaction states to a minimum.

I just realized the postgres connection string has a bunch of setting for the connection pooling, for example:
User ID=root;Password=myPassword;Host=localhost;Port=5432;Database=myDataBase;
Pooling=true;Min Pool Size=0;Max Pool Size=100;Connection Lifetime=0;
So in my code, I can just close the connection after the command finished execution. But behind the scene, the connection is actually still alive and stored in the connection pool to be used again.

Related

Multithreading with PostgreSQL JDBC

I'm still a student and not so experienced with multithreading and databases so I might have missed some obvious stuff - hoping for an answer at a more beginner level.
I'm busy creating a dummy Java application that allows users to submit subway station locations and then lookup the nearest station to their location. This is all happening over HTTP.
The backend for this application is PostgreSQL (with PostGis) and I connect to the database via the PostgreSQL JDBC.
I want my application to be as multithreaded as possible. Every time I receive a new HTTP connection, I spin up a new thread and service the users request. But I'm not sure how much point there is to this if reads and writes to the database themselves cannot be parallel.
According to this, PostgreSQL JDBC is not thread safe. But what does that mean exactly? Does that just mean that reads and writes within a single connection are not thread safe (i.e. in each instance of DriverManager.getConnection())? But what about if I made a new connection every time an HTTP request came in? Would that be safe to do in parallel? And would that affect performance badly?
Any other suggestions on broad approach to take?
JDBC in general is not thread-safe, not just the Postgres driver.
This means you can not run multiple statements created from the same Connection instance in multiple threads at the same time.
If you want to run two statements in parallel, you need to create two different physical connections to the database.
As creating connections isn't cost-free, the usual approach is to have a connection pool (e.g. through a ConnectionPoolDataSource) that keeps a set of connections open. The application then takes connections from the pool and puts them back when the query (or transaction) is done.

knexfile settings when using PgBouncer

We have a setup where multiple Node processes write into the same database (different tables), and as a result, when using Knex, we end up with more connections to the database than desirable. So, I was thinking of using PgBouncer as a middleware for the Knex processes to connect to, but I'm unsure of how Knex's attempts at connection pooling will work with PgBouncer, which will setup its own pool of connections.
Please assume the following:
A 2vCPU database server
10+ Node processes interacting with the database
PgBouncer running with a pool size of 5
Questions:
If I set min/max size as 1/5 in each Knex setup, will I run out of connections or will PgBouncer somehow be able to "fool" each Knex setup into believing that it has its own pool?
It doesn't feel like I can use a Knex pool in this scenario. Even using min/max pool sizes as 1/1 will leave me out of options if the first five Knex steups I launch claim a connection each.
Is there a way to make Knex drop pooling and open/close connections as needed? This is the ideal setup for me because now PgBouncer won't actually be opening/closing connections but returning them to the pool (unless I'm mistaken about this?).
What strategy should I use? What should my knexfile look like? And would I need to code differently for this? Any help or ideas are welcome!
While it would be ridiculous to allow 32000 connections, it is also ridiculous to allow only 5. I think the lesson from your link should be not that there is a precisely defined magic number of connections, but that you need to look at the waitevents of your performing database, or just do experiments, to see what is going on and whether you have too many connections.
While repeatedly connecting to pgbouncer (which reuses its internal connection to PostgreSQL) might be less expensive than repeatedly connecting all the way through to PostgreSQL, it will still be far more expensive than just re-using an existing connection from knex's internal connection pool. If your connection load is high enough to matter, then bypassing the internal connection pool to just use pgbouncer would be a mistake. Most likely using pgbouncer at all is a mistake, as it just introduces yet another moving piece for no good reason.
Using knex pooler with min:1 and max:5 with 10 different knex app servers and a limit of 5 connections in pgbouncer would mean that only 5 of your app servers could have a connection. The rest would be forced to wait, but it isn't clear what they would be waiting for. Presumably they would wait forever, or until they caught a timeout error, or until one of other app servers exited or shutdown its pool. Pgbouncer would fool them all right, but not in a helpful way. It might make more sense to use this a min:0 (which is now the recommended setting, but still not the default), as that way an app server would at least release its final connection after idleTimeoutMillis, allowing another app to use it.
Using min:1 max:1 could be useful if pgbouncer were not used or used with a large enough pool size, but it could also break entirely. For example, if an app needs at least 2 simultaneous connections to work correctly. That would probably be a poorly written app, but poorly written apps are the rule, not the exception.

Will PgBouncer reuse postgresql session sequence cache?

I want to use postgres sequence with cache CREATE SEQUENCE serial CACHE 100.
The goal is to improve performance of 3000 usages per second of SELECT nextval('serial'); by ~500 connection/application threads concurrently.
The issue is that I am doing intensive autoscaling and connections will be disconnected and reconnected occasionally leaving "holes" of unused ids in the sequence each time a connection is disconnected.
Well, the good news might be that I am using a PgBouncer heroku buildpack with transaction pool mode.
My question is: will the transaction pool mode solve the "holes" issues that I described, will it reuse the session in a way that the next application connection will take this session from the pool and continue using the cache of the sequence?
This depends on the setting of server_reset_query. If you have that set to DISCARD ALL, then sequence caches are discarded before a server connected is handed out to a client. But for transaction pooling, the recommended server_reset_query is empty, so you will be able to reuse sequence caches in that case. You can also use a different DISCARD command, depending on your needs.

Does it make sense to use pg_pconnect (php-fpm)

I have about 11000 hits a second on 10 servers with php-fpm. I'm migrating to postgres from mysql, so my question is Does it make sense to use pg_*p*connect?
It's better to use a dedicated connection pooler like PgBouncer.
Performance should be comparable to pg_pconnect, but PgBouncer will allow to perform a cleanup after an error in PHP code. pg_pconnect will not automatically clean open transactions, locks, prepared statements etc.
Establishing a connection to a PostgreSQL server is expected to be significantly more expensive than to a MySQL server. This is due to different design choices of these databases in how they handle resource allocation and privilege separation between independent connections.
Therefore, for a website, it totally makes sense to reuse connections to PostgreSQL whenever possible.
The way generally recommended is not to use pg_pconnect but rather an external connection pooler like pgBouncer or pgPoolII which are better suited for this task. When using PHP-FPM however, you already have a middleware that lets you control somehow the number of open connections through the fpm process manager options, so it may be good enough. You may consider setting pm.max_requests to a non-zero value to make sure that connections get cleaned up at a reasonable frequency and avoid keeping a pile of unused connections during off-peak hours.
Well, pg_pconnect will mean you have one connection per PHP backend, so it depends how many backends you have. With a traditional Apache mod-php setup it'd be a non-starter but you might get away with it.
The database server can handle hundreds of idle connections, but almost certainly grind to a halt if they all have queries being issued concurrently. I've seen a rule-of-thumb of no more than two connections per core - that's assuming I/O doesn't limit you first.
The common approach is to run a connection pooler like pgbouncer and have php connect per-request. That reduces your connection overhead while keeping concurrency plausible.

PostgreSQL consuming large amount of memory for persistent connection

I have a C++ application which is making use of PostgreSQL 8.3 on Windows. We use the libpq interface.
We have a multi-threaded app where each thread opens a connection and keeps using without PQFinish it.
We notice that for each query (especially the SELECT statements) postgres.exe memory consumption would go up. It goes up as high as 1.3 GB. Eventually, postgres.exe crashes and forces our program to create a new connection.
Has anyone experienced this problem before?
EDIT: shared_buffer is currently set to be 128MB in our conf. file.
EDIT2: a workaround that we have in place right now is to call PQfinish for every transaction. But then, this slows down our processing a bit since establishing a connection every time is quite slow.
In PostgreSQL, each connection has a dedicated backend. This backend not only holds connection and session state, but is also an execution engine. Backends aren't particularly cheap to leave lying around, and they cost both memory and synchronization overhead even when idle.
There's an optimum number of actively working backends for any given Pg server on any given workload, where adding more working backends slows things down rather than speeding it up. You want to find that point, and limit the number of backends to around that level. Unfortunately there's no magic recipe for this, it mostly involves benchmarking - on your hardware and with your workload.
If you need more connections than that, you should use a proxy or pooling system that allows you to separate "connection state" from "execution engine". Two popular choices are PgBouncer and PgPool-II . You can maintain light-weight connections from your app to the proxy/pooler, and let it schedule the workload to keep the database server working at its optimum load. If too many queries come in, some wait before being executed instead of competing for resources and slowing down all queries on the server.
See the postgresql wiki.
Note that if your workload is read-mostly, and especially if it has items that don't change often for which you can determine a reliable cache invalidation scheme, you can also potentially use memcached or Redis to reduce your database workload. This requires application changes. PostgreSQL's LISTEN and NOTIFY will help you do sane cache invalidation.
Many database engines have some separation of execution engine and connection state built in to the core database engine's design. Sybase ASE certainly does, and I think Oracle does too, but I'm not too sure about the latter. Unfortunately, because of PostgreSQL's one-process-per-connection model it's not easy for it to pass work around between backends, making it harder for PostgreSQL to do this natively, so most people use a proxy or pool.
I strongly recommend that you read PostgreSQL High Performance. I don't have any relationship/affiliation with Greg Smith or the publisher*, I just think it's great and will be very useful if you're concerned about your DB's performance.
* ... well, I didn't when I wrote this. I work for the same company now.
The memory usage is not necessarily a problem. PostgreSQL uses shared memory for some caching, and this memory does not count towards the size of the process memory usage until it's actually used. The more you use the process, the larger parts of the shared buffers will be active in it's address space.
If you have a large value for shared_buffers, this will happen. If you have it too large, the process can run out of address space and crash, yes.
The problem is probably that you don't close the transaction,
In PostgreSQL even if you do only selects without DML it runs in transaction which need to be rollback.
By adding rollback at the end of the transaction will reduce your memory problem