HikariCP connection pool - 'active' - how to debug? - postgresql

I am building an app using Spring-Boot/Hibernate with Postgres as the database. I am using Spring 2.0, so Hikari is the default connection pool provider.
Currently, I am trying to load-test the application with a REST end-point that does an 'update-if-exists and insert if new' to an entity in the database. Its a fairly small entity with 'BIGSERIAL' primary key and no constraints on any other field.
The default connection pool size is 10 and I haven't really tweaked any other parameters - either of the HikariCP or for Postgres.
The point at which I am stuck at this moment is to debug connections in 'active' state and what they are doing or why they stuck currently.
When I run '10 simultaneous users', it basically translates into 2 or 3 times that many queries and thus, when I turn on the HikariCP debug logs, it hangs at something like this -
(total=10, active=10, idle=0, waiting=2) and the 'active' connections do not really release the connections, which is what I am trying to find out because the queries are fairly simple and the table itself is just 4 fields (including the primary key).
The best practices from HikariCP folks as well generally is that increasing the connection pool is not the right first step towards scaling.
If I do increase the connection pool size to 20, things start working for 10 simultaneous/concurrent users but then again, its not the root cause/solution for the problem I believe.
Is there any way I can log either Hibernate or Postgres messages that might help in knowing what these 'active' connections are waiting on and why the connection doesn't get released even after I increase the wait-time to a long time?
If it is a connection-leak ( as is reported when the leak-detection-threshold is reduced to a lower value (e.g. 30 seconds) ), then how can I tell if Hibernate is responsible for this connection leak or if it is something else?
If it is a lock/wait at the database level, how can I get the root of this?
UPDATE
After help from #brettw, I took a thread-dump when the connections were exhausted and it pointed in the direction of a connection-leak. The threads on HikariCP issues board - https://github.com/brettwooldridge/HikariCP/issues/1030#issuecomment-347632771 - which points to the Hibernate not closing connections which then pointed me to https://jira.spring.io/browse/SPR-14548, which talks about setting Hibernate's connection closing mode since the default mode holds the connection for too long. After setting spring.jpa.properties.hibernate.connection.handling_mode=DELAYED_ACQUISITION_AND_RELEASE_AFTER_TRANSACTION, the connection pool worked perfectly.
Also, the point made here - https://github.com/brettwooldridge/HikariCP/issues/612#issuecomment-209839908 is right - a connection leak should not be covered up by the pool.

It sounds like you could be hitting a true deadlock in the database. There should be a way to query PostgreSQL for current active queries, and current lock states. You'll have to google it.
Also, I would try a simple thread dump to see where all the threads are blocked. It could be a code-level synchronization deadlock.
If all of the threads are blocked on getConnection(), it is a leak.
If all of the threads are down in the driver, according to the stacktrace for each thread, it is a database deadlock.
If all of the threads are blocked waiting for a lock in your application code, then you have a synchronization deadlock -- likely two locks with inverted acquisition order in different parts of the code.
The HikariCP leakDetectionThreshold could be useful, but it will only show where the connection was acquired, not where the thread is currently stuck. Still, it could provide a clue.

Related

knexfile settings when using PgBouncer

We have a setup where multiple Node processes write into the same database (different tables), and as a result, when using Knex, we end up with more connections to the database than desirable. So, I was thinking of using PgBouncer as a middleware for the Knex processes to connect to, but I'm unsure of how Knex's attempts at connection pooling will work with PgBouncer, which will setup its own pool of connections.
Please assume the following:
A 2vCPU database server
10+ Node processes interacting with the database
PgBouncer running with a pool size of 5
Questions:
If I set min/max size as 1/5 in each Knex setup, will I run out of connections or will PgBouncer somehow be able to "fool" each Knex setup into believing that it has its own pool?
It doesn't feel like I can use a Knex pool in this scenario. Even using min/max pool sizes as 1/1 will leave me out of options if the first five Knex steups I launch claim a connection each.
Is there a way to make Knex drop pooling and open/close connections as needed? This is the ideal setup for me because now PgBouncer won't actually be opening/closing connections but returning them to the pool (unless I'm mistaken about this?).
What strategy should I use? What should my knexfile look like? And would I need to code differently for this? Any help or ideas are welcome!
While it would be ridiculous to allow 32000 connections, it is also ridiculous to allow only 5. I think the lesson from your link should be not that there is a precisely defined magic number of connections, but that you need to look at the waitevents of your performing database, or just do experiments, to see what is going on and whether you have too many connections.
While repeatedly connecting to pgbouncer (which reuses its internal connection to PostgreSQL) might be less expensive than repeatedly connecting all the way through to PostgreSQL, it will still be far more expensive than just re-using an existing connection from knex's internal connection pool. If your connection load is high enough to matter, then bypassing the internal connection pool to just use pgbouncer would be a mistake. Most likely using pgbouncer at all is a mistake, as it just introduces yet another moving piece for no good reason.
Using knex pooler with min:1 and max:5 with 10 different knex app servers and a limit of 5 connections in pgbouncer would mean that only 5 of your app servers could have a connection. The rest would be forced to wait, but it isn't clear what they would be waiting for. Presumably they would wait forever, or until they caught a timeout error, or until one of other app servers exited or shutdown its pool. Pgbouncer would fool them all right, but not in a helpful way. It might make more sense to use this a min:0 (which is now the recommended setting, but still not the default), as that way an app server would at least release its final connection after idleTimeoutMillis, allowing another app to use it.
Using min:1 max:1 could be useful if pgbouncer were not used or used with a large enough pool size, but it could also break entirely. For example, if an app needs at least 2 simultaneous connections to work correctly. That would probably be a poorly written app, but poorly written apps are the rule, not the exception.

Connection Pool Capabilities in DigitalOcean PostgreSQL Managed Databases

I have connection pools setup for my system to handle concurrent connections for my managed database clusters in DigitalOcean.
Overall, each client I have, has their own DB, then I create a pool for that connection to avoid the error:
FATAL: remaining connection slots are reserved for non-replication superuser connections
Yesterday I ran into connection issues with a default database that my system also uses, I hadn't thought the connection pooling was needed for whatever dumb reason or another. No worries, I started getting flooded with error emails and then fixed the system to use the correct pooling mechanism.
This is where my question comes in, with the pooling on DigitalOcean they give you a specific "size" depending on your subscription, my subscription has an available "size" for the clusters of 97. As my clients grow I will be creating new pools and databases for them, so eventually I will run out of slots to assign a pool...what does this "size" dictate?
For example 1 client I have has an allotted size of 10 to their connection pool. Speaking to support:
The connection pool with a size of 1 will only allow 1 connection at a time. As for how you can estimate the number of simultaneous users, this is something you'll need to look over as your user and application grow. We don't have a way to give you that estimate from our back end.
So with that client that has a size of 10 alloted to their pool, they have 88 staff users that use the system simultaneously throughout the day, then they have about 4,000 users that they manage that can all sign in theoretically at once.
This is a lot more than 10 connections, and I get no errors on connection size at least that I've seen so far.
Given that I have a limited amount, how do I determine the appropriate size to use, does anybody have experience with this in production?
For example, with the connections listed above, is 10 too much, too little, just right?
Update 2/14/23
I have tested the capabilities bit because I was curious and can't get any semi-logical answer. When I use 1 connection pool for my 4,000 user client (although all users would not hit their DB/pool at the same time), I get connection errors (specifically when running background tasks from django-celery and Celery in the middle of the night).
Here are those errors, overall just connection already closed from here:
File "/usr/local/lib/python3.11/site-packages/django/db/backends/postgresql/base.py", line 269, in create_cursor cursor = self.connection.cursor()
This issue happened concurrently on 2 nights, but never during the day during normal user activity.
Once I upped the connection pool for said 4,000 user client to 2 instead of 1 the connection already closed error never occurred again.

Will PgBouncer reuse postgresql session sequence cache?

I want to use postgres sequence with cache CREATE SEQUENCE serial CACHE 100.
The goal is to improve performance of 3000 usages per second of SELECT nextval('serial'); by ~500 connection/application threads concurrently.
The issue is that I am doing intensive autoscaling and connections will be disconnected and reconnected occasionally leaving "holes" of unused ids in the sequence each time a connection is disconnected.
Well, the good news might be that I am using a PgBouncer heroku buildpack with transaction pool mode.
My question is: will the transaction pool mode solve the "holes" issues that I described, will it reuse the session in a way that the next application connection will take this session from the pool and continue using the cache of the sequence?
This depends on the setting of server_reset_query. If you have that set to DISCARD ALL, then sequence caches are discarded before a server connected is handed out to a client. But for transaction pooling, the recommended server_reset_query is empty, so you will be able to reuse sequence caches in that case. You can also use a different DISCARD command, depending on your needs.

How do I manage connection pooling to PostgreSQL from sidekiq?

The problem I have a rails application that runs a few hundred sidekiq background processes. They all connect to a PostgreSQL database which is not exactly happy about providing 250 connections - it can, but if all sidekiq processes accidentally send queries to the db, it crumbles.
Option 1 I have been thinking about adding pgBouncer in front of the db, however I cannot currently use it's transactional mode, since I'm highly dependent upon setting the search_path at the beginning of each job processing for determining which "country" (PostgreSQL schema) to work on (apartment-gem). In this case, I would have to use the session based connection pooling mode. This however would, as far as I know, require me to disconnect the connections after each job processing, to release the connections back into the pool, and that would be really costly performance wise wouldn't it? Am I missing out on something?
Option 2 use application layer based connection pooling is of cause also an option, however I'm not really sure how I would be able to do that for PostgreSQL with sidekiq?
Option 3 something I have not thought of?
Option 1: You're correct, sessions would require you to drop and reconnect and that adds overhead. How costly would be dependent on access pattern ie what fraction of the connection/tcp handshake etc is of the total work done and what sort of latency you need. Definitely worth benchmarking but if the connections are short lived then the overhead will be really noticeable.
Option 2/3: You could rate limit or throttle your sidekiq jobs. There are a few projects here tackling this...
Queue limits
Sidekiq Limit Fetch: Restrict number of workers which are able to run specified queues simultaneously. You can pause queues and resize queue distribution dynamically. Also tracks number of active workers per queue. Supports global mode (multiple sidekiq processes). There is an additional blocking queue mode.
Sidekiq Throttler: Sidekiq::Throttler is a middleware for Sidekiq that adds the ability to rate limit job execution on a per-worker basis.
sidekiq-rate-limiter: Redis backed, per worker rate limits for job processing.
Sidekiq::Throttled: Concurrency and threshold throttling.
I got the above from here
https://github.com/mperham/sidekiq/wiki/Related-Projects
If your application must have a connection per process and you're unable to break it up where more threads can use a connection then it's pgBouncer or Application based connection pooling. Connection pooling is in effect either going to throttle or limit your app in some way in order to save the DB.
Sidekiq should only require one connection for each worker thread. If you are setting your concurrency to a reasonable value, say 10-25, I don't think you should be using 250 simultaneous database connections. How many worker processes are you running, and what is their concurrency?
Also, you can see on that page that even if you have a high concurrency setting, you can still create a connection pool shared by the threads within that process.

Postgres connection should be always on? or connect before running each query?

I am debating if I should keep my postgres connection always on, and check/re-connect before running query. Or I should connect it before run each query and close the connection as soon as it is done. Thanks!
As long as the Postgres server isn't totally jammed with connections (i.e. this is not an app that will be creating a gigantic number of perpetual connections), I don't think it's a problem to maintain the connection. I would also recommend checking the connection and handling reconnects prior to each query however. Many libraries offer ways to do this. For example, with MyBatis (Java), you can have it issue a test query each time, which can be specified. I use the lightweight SELECT 1 for this.
I would say the key thing to consider is to keep the connection idle in transaction for as little time as possible, as when that happens, it can have a variety of different impacts on performance (such as slowing down other queries, preventing high-turnover tables from being vacuumed in a timely manner, etc.). This is not to say that any time spent in idle in transaction is automatically bad, but it should be considered and minimized where possible. (e.g. if you have some calculations that are going several minutes to run, make sure to either commit or rollback prior to doing those (which one would depend on context).
If you're doing a bunch of SELECTs, and don't have anything you need to commit, I would recommend doing a rollback to help keep the idle in transaction states to a minimum.
I just realized the postgres connection string has a bunch of setting for the connection pooling, for example:
User ID=root;Password=myPassword;Host=localhost;Port=5432;Database=myDataBase;
Pooling=true;Min Pool Size=0;Max Pool Size=100;Connection Lifetime=0;
So in my code, I can just close the connection after the command finished execution. But behind the scene, the connection is actually still alive and stored in the connection pool to be used again.