What is the ideal number of max connections for a postgres database? - postgresql

I'm currently using the default connection pool in sequelize, which is as follows:
const defaultPoolingConfig = {
max: 5,
min: 0,
idle: 10000,
acquire: 10000,
evict: 10000,
handleDisconnects: true
};
Of late, I'm getting these errors ResourceRequest timed out which are due to the above DB configuration. According to some answers the max pool should be set to 5, but those who have faced the above, Resource timeout, error have suggested to increase the pool size to 30, along with increasing the acquire time.
I need to know what must be the optimum value of max pool size for a web-app.
Edit: 1.Lets say I have 200 concurrent users, and I have 20 concurrent queries. Then what should be the values?
2.My database is provided by GCP, with the following configuration:
vCPUs
1
Memory
3.75 GB
SSD storage
10 GB
I'm adding some graphs for CPU utilization, Read / write operations per second and transactions per second.
My workload resources are as follows:
resources:
limits:
cpu: 500m
memory: 600Mi
requests:
cpu: 200m
memory: 500Mi

The number of concurrent connections should be large enough for the number of concurrent running queries or transactions you may have.
If you have a lower limit, then new queries/transactions will have to wait for an available connection.
You may want to monitor currently running queries (see pg_stat_activity for instance) to detect such issues.
However, your database server must be able to handle the number of connections. If you are using a server provided by a third party, it may have set limits. If you are using your own server, then it needs to be configured properly.
Note that to handle more connections, your database server will need more processes and more RAM. Also, if they are long running queries (as opposed to transactions), then you are most probably resource-constrained on the server (often I/O-bound), and adding more queries running at the same time usually won't help with overall performance. You may want to look at configuration of your DB server (buffers etc.), and of course, if you haven't already done so, optimise your queries (make sure they all use indexes). The other pg_stat_* views and EXPLAIN are your friends here.
If you have long-running transactions with lots of idle time, then more concurrent connections may help, though you may have to wonder why you have such long-running transactions.
To summarise, your next steps should be to:
Check the immediate state of your database server using pg_stat_activity and friends.
If you don't already have that, set up monitoring of I/O, CPU, memory, swap, postgresql statistics over time. This will give you a clearer picture of what is going on on your server. If you don't have that, you're just running blind.
If you have long-running transactions, check that you always correctly release transactions/connections, including when errors occur. This is a pretty common issue with node.js-based web servers. Make sure you use try .. catch blocks wherever needed.
If there are any long-running queries, check that they are properly optimised (using indexes). If not, do your utmost to optimise them. This will be the single most useful step you can take if that's were the issue is.
If they are properly optimised and you have enough spare resources (RAM, I/O...), then you can consider raising the number of connections. Otherwise it's just pointless.
Edit
Since you are not operating the database yourself, you won't necessarily have all the visibility you could have on resource usage.
However, you can still:
Check pg_stat_activity. This alone will tell you a lot of things.
Check for connections/transactions that are kept around when they shouldn't
Check queries are properly optimised
GCP has a default maximum concurrent connections limit set to 100 for instances with 3.75 GiB of RAM. So you could indeed increase the size of your pool. But if any of the above issues are present, you are just delaying or moving the issue a bit further, so start by checking those and fixing them if relevant.

Related

Maximising concurrent request handling with PostgreSQL / Npgsql client

I have a db and client app that does reads and writes, I need to handle a lot of concurrent reads but be sure that writes get priority, while also respecting my db’s connection limit.
Long version:
I have a single instance pgSQL database which allows 100 connections.
My .net microservice uses Npgsql to connect to the db. It has to do read queries that can take 20-2000ms and writes that can take about 500-2000ms. Right now there are 2 instances of the app, connecting with the same user credentials. I am trusting Npgsql to manage my connection pooling, and am preparing my read queries as there are basically just 2 or 3 variants with different parameter values.
As user requests increased, I started having problems with the database’s connection limit. Errors like ‘Too many connections’ from the db.
To deal with this I introduced a simple gate system in my repo class:
private static readonly SemaphoreSlim _writeGate = new(20, 20);
private static readonly SemaphoreSlim _readGate = new(25, 25);
public async Task<IEnumerable<SomeDataItem>> ReadData(string query, CancellationToken ct)
{
await _readGate.WaitAsync(ct);
// try to get data, finally release the gate
_readGate.Release();
}
public async Task WriteData(IEnumerable<SomeDataItem>, CancellationToken ct)
{
await _writeGate.WaitAsync(ct);
// try to write data, finally release the gate
_writeGate.Release();
}
I chose to have separate gates for read and write because I wanted to be confident that reads would not get completely blocked by concurrent writes.
The limits are hardcoded as above, a total of limit of 45 on each of the 2 app instances, connecting to 1 db server instance.
It is more important that attempts to write data do not fail than attempts to read. I have some further safety here with a Polly retry pattern.
This was alright for a while, but as the concurrent read requests increase, I see that the response times start to degrade, as a backlog of read requests begins to accumulate.
So, for this question, assume my sql queries and db schema are optimized to the max, what can I do to improve my throughput?
I know that there are times when my _readGate is maxed out, but there is free capacity in the _writeGate. However I don’t dare reduce the hardcoded limits because at other times I need to support concurrent writes. So I need some kind of QoS solution that can allow more concurrent reads when possible, but will give priority to writes when needed.
Queue management is pretty complicated to me but is also quite well known to many, so is there a good nuget package that can help me out? (I’m not even sure what to google)
Is there a simple change to my code to improve on what I have above?
Would it help to have different conn strings / users for reads vs writes?
Anything else I can do with npgsql / connection string that can improve things?
I think that postgresql recommends limiting connections to 100, there's a SO thread on this here: How to increase the max connections in postgres?
There's always a limit to how many simultaneous queries that you can run before the perf would stop improving and eventually drop off.
However I can see in my azure telemetry that my db server is not coming close to fully using cpu, ram or disk IO (cpu doesn't exceed 70% and is often less, memory the same, and IOPS under 30% of its capacity) so I believe there is more to be squeezed out somewhere :)
Maybe there are other places to investigate, but for the sake of this question I'd just like to focus on how to better manage connections.
First, if you're getting "Too many connections" on the PostgreSQL side, that means that the total number of physical connections being opened by Npgsql exceeds the max_connection setting in PG. You need to make sure that the aggregate total of Npgsql's Max Pool Size across all app instances doesn't exceed that, so if your max_connection is 100 and you have two Npgsql instances, each needs to run with Max Pool Size=50.
Second, you can indeed have different connection pools for reads vs. writes, by having different connection strings (a good trick for that is to set the Application Name to different values). However, you may want to set up one or more read replicas (primary/secondary setup); this would allow all read workload to be directed to the read replica(s), while keeping the primary for write operation only. This is a good load balancing technique, and Npgsql 6.0 has introduced great support for it (https://www.npgsql.org/doc/failover-and-load-balancing.html).
Apart from that, you can definitely experiment with increasing max_connection on the PG side - and accordingly Max Pool Size on the clients' side - and load-test what this do to resource utilization.

Locust eats CPU after 2-3 hours running

I have a simple HTTP server that I was testing. This server interacts with other HTTP servers and Cassandra DB.
Currently I was using 100 users with 1 request/s, so totally 100 tps was on the server. What I noticed with the Docker stats was that the CPU usage became higher and higher and ~ 2-3 hours later the CPU usage reaches the 90% mark, and even more. After that I got a notice from Locust, stating that the measurement may be inconsistent. But the latencies were not increased, so I do not know why this has been happening.
Can you please suggest possible cause(s) of the problem? I think 100 tps should be handled by one vCPU.
Thanks,
AM
There's no way for us to know exactly what's wrong without at very least seeing some code, and even then other factors like the environment or data or server you're running it on or against could have additional factors we wouldn't know about.
It's possible you have a problem with your code for your Locust users, such as a memory leak or they're just doing too much for a single worker to handle that many users. For users only doing simple HTTP calls, a single CPU typically can handle upwards of thousands of requests per second. Do anything more than that and you'll start to expect to reduce what a worker can handle. It's also possible you may just need a more powerful CPU (or more RAM or bandwidth) to do what you want it to do at the scale you want.
Do some profiling to see if you can find any inefficiencies in your code. Run smaller tests to see if the same behavior is evident with smaller loads. Run the same load but with additional Locust workers on other CPUs.
It's also just as possible your DB can't handle the load. The increasing CPU usage could be due to how your code is handling waiting on the connection from the DB. Perhaps the DB could sustain, say, 80 users at an acceptable rate but any additional users makes it fall further and further behind and your Locust users are then waiting longer and longer for the requested data.
For more suggestions, check out the Locust FAQ https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps

High CPU usage on Cloud SQL causing timeouts

We have a postres database that has billions of records in it.
We have one client that uses our older API to query the database to fetch thousands of records once a day.
I would say close to the top end of the thousands.
The API is currently on a compute engine behind a load balancer and during the allotted time I spin up 6 instances of this to attempt to help handle the load.
What I have found is that the CPU usage on cloud SQL is maxing out at 100% and most of the other stats are fine, it's just the CPU.
This basically renders our API useless as we can't accept connections and it just shits its self.
What can we do to help this?
Here is the CPU utilisation chart
And the connections
Read/Writes
Memory Usage
You can see in most of the other charts the readings are well within normal for what we expect.
I don't really want to have to beef up the CPU usage if it isn't really the actual underlining problem.
A further thing to note is we have developed a new endpoint for this client specifically to use, they have not got that in place yet, and there is no guarantee that it will reduce the db load.
High CPU usage can most definitely cause dropped or ignored connections. The database engine and underlying OS are fighting for resources and aren't able to respond to the connection in time.
While you can increase CPU usage, it looks like the CPU usage you have it (usually) enough, except during parts where the CPU is at 100%. I'd suggest instead finding out why the query is eating so much CPU usage and optimizing it.
You might be interested in something like Cloud SQL Insights to help debug the query.

MongoDB Stops Responding During Background Flush

Mongodb Background Flushing blocks all the requests:
Server: Windows server 2008 R2
CPU Usage: 10 %
Memory: 64G, Used 7%, 250MB for Mongod
Disk % Read/Write Time: less than 5% (According to Perfmon)
Mongodb Version: 2.4.6
Mongostat Normally:
insert:509 query:608 update:331 delete:*0 command:852|0 flushes:0 mapped:63.1g vsize:127g faults:6449 locked db:Radius:12.0%
Mongostat Before(maybe while) Flushing:
insert:1 query:4 update:3 delete:*0 command:7|0 flushes:0 mapped:63.1g vsize:127g faults:313 locked db:local:0.0%
And Mongostat After Flushing:
insert:1572 query:1849 update:1028 delete:*0 command:2673|0 flushes:1 mapped:63.1g vsize:127g faults:21065 locked db:.:99.0%
As you see when flushes happening lock is 99% just at this point mongod stops responding any read/write operation (mongotop and mongostat also stop). The flushing takes about 7 to 8 seconds to complete which does not increase disk load more than 10%.
Is there any suggestions?
Under Windows server 2008 R2 (and other versions of Windows I would suspect, although I don't know for sure), MongoDB's (2.4 and older) background flush process imposes a global lock, doing substantial blocking of reads and writes, and the length of the flush time tends to be proportional to the amount of memory MongoDB is using (both resident and system cache for memory-mapped files), even if very little actual write activity is going on. This is a phenomenon we ran into at our shop.
In one replica set where we were using MongoDB version 2.2.2, on a host with some 128 GBs of RAM, when most of the RAM was in use either as resident memory or as standby system cache, the flush time was reliably between 10 and 15 seconds under almost no load and could go as high as 30 to 40 seconds under load. This could cause Mongo to go into long pauses of unresponsiveness every minute. Our storage did not show signs of being stressed.
The basic problem, it seems, is that Windows handles flushing to memory-mapped files differently than Linux. Apparently, the process is synchronous under Windows and this has a number of side effects, although I don't understand the technical details well enough to comment.
MongoDb, Inc., is aware of this issue and is working on optimizations to address it. The problem is documented in a couple of tickets:
https://jira.mongodb.org/browse/SERVER-13444
https://jira.mongodb.org/browse/SERVER-12401
What to do?
The phenomenon is tied, to some degree, to the minimum latency of the disk subsystem as measured under low stress, so you might try experimenting with faster disks, if you can. Some improvements have been reported with this approach.
A strategy that worked for us in some limited degree is avoiding provisioning too much RAM. It happened that we really didn't need 128 GBs of RAM, so by dialing back on the RAM, we were able to reduce the flush time. Naturally, that wouldn't work for everyone.
The latest versions of MongoDB (2.6.0 and later) seem to handle the
situation better in that writes are still blocked during the long
flush but reads are able to proceed.
If you are working with a sharded cluster, you could try dividing the RAM by putting multiple shards on the same host. We didn't try this ourselves, but it seems like it might have worked. On the other hand, careful design and testing would be highly recommended in any such scenario to avoid compromising performance and/or high availability
We tried playing with syncdelay. Reducing it didn't help (the long flush times just happened more frequently). Increasing it helped a little (there was more time between flushes to get work done), but increasing it too much can exacerbate the problem severely. We boosted the syncdelay to five minutes (300 seconds), at one point, and were rewarded with a background flush of 20 minutes.
Some optimizations are in the works at MongoDB, Inc. These may be available soon.
In our case, to relieve the pressure on the primary host, we periodically rebooted one of the secondaries (clearing all memory) and then failed over to it. Naturally, there is some performance hit due to re-caching, and I think this only worked for us because our workload is write-heavy. Moreover, this technique not in any sense a solution. But if high flush times are causing serious disruption, this may be one way to "reduce the fever" so to speak.
Consider running on Linux... :-)
Background flush by default does not block read/write. mongod does flush every 60s, unless otherwise specified with -syncDelay parameter. syncDelay uses fsync() operation, which can set to block write while in-memory pages flush to disk. A blocked write could have potential to block reads as well. Read more: http://docs.mongodb.org/manual/reference/command/fsync/
However, normally a flush should not take more than 1000ms (1 second). If it does, it is likely the amount of data flushing to disk is too large for your disk to handle.
Solution: upgrade to a faster disk like SSD, or decrease flush interval (try 30s, rather than the default 60s).

PostgreSQL consuming large amount of memory for persistent connection

I have a C++ application which is making use of PostgreSQL 8.3 on Windows. We use the libpq interface.
We have a multi-threaded app where each thread opens a connection and keeps using without PQFinish it.
We notice that for each query (especially the SELECT statements) postgres.exe memory consumption would go up. It goes up as high as 1.3 GB. Eventually, postgres.exe crashes and forces our program to create a new connection.
Has anyone experienced this problem before?
EDIT: shared_buffer is currently set to be 128MB in our conf. file.
EDIT2: a workaround that we have in place right now is to call PQfinish for every transaction. But then, this slows down our processing a bit since establishing a connection every time is quite slow.
In PostgreSQL, each connection has a dedicated backend. This backend not only holds connection and session state, but is also an execution engine. Backends aren't particularly cheap to leave lying around, and they cost both memory and synchronization overhead even when idle.
There's an optimum number of actively working backends for any given Pg server on any given workload, where adding more working backends slows things down rather than speeding it up. You want to find that point, and limit the number of backends to around that level. Unfortunately there's no magic recipe for this, it mostly involves benchmarking - on your hardware and with your workload.
If you need more connections than that, you should use a proxy or pooling system that allows you to separate "connection state" from "execution engine". Two popular choices are PgBouncer and PgPool-II . You can maintain light-weight connections from your app to the proxy/pooler, and let it schedule the workload to keep the database server working at its optimum load. If too many queries come in, some wait before being executed instead of competing for resources and slowing down all queries on the server.
See the postgresql wiki.
Note that if your workload is read-mostly, and especially if it has items that don't change often for which you can determine a reliable cache invalidation scheme, you can also potentially use memcached or Redis to reduce your database workload. This requires application changes. PostgreSQL's LISTEN and NOTIFY will help you do sane cache invalidation.
Many database engines have some separation of execution engine and connection state built in to the core database engine's design. Sybase ASE certainly does, and I think Oracle does too, but I'm not too sure about the latter. Unfortunately, because of PostgreSQL's one-process-per-connection model it's not easy for it to pass work around between backends, making it harder for PostgreSQL to do this natively, so most people use a proxy or pool.
I strongly recommend that you read PostgreSQL High Performance. I don't have any relationship/affiliation with Greg Smith or the publisher*, I just think it's great and will be very useful if you're concerned about your DB's performance.
* ... well, I didn't when I wrote this. I work for the same company now.
The memory usage is not necessarily a problem. PostgreSQL uses shared memory for some caching, and this memory does not count towards the size of the process memory usage until it's actually used. The more you use the process, the larger parts of the shared buffers will be active in it's address space.
If you have a large value for shared_buffers, this will happen. If you have it too large, the process can run out of address space and crash, yes.
The problem is probably that you don't close the transaction,
In PostgreSQL even if you do only selects without DML it runs in transaction which need to be rollback.
By adding rollback at the end of the transaction will reduce your memory problem