.Net Core connection pool exhausted (postgres) under heavy load spike, when new pod instances are created - postgresql

I have an application which runs stable under heavy load, but only when the load increase in graduate way.
I run 3 or 4 pods at the same time, and it scales to 8 or 10 pods when necessary.
The standard requests per minute is about 4000 (means 66 req-per-second per node, means 16 req-per-second per single pod).
There is a certain scenario, when we receive huge load spike (from 4k per minute to 20k per minute). New pods are correctly created, then they start to receive new load.
Problem is, that in about 10-20% of cases newly created pod struggles to handle initial load, DB requests are taking over 5000ms, piling up, finally resulting in exception that connection pool was exhausted: The connection pool has been exhausted, either raise MaxPoolSize (currently 200) or Timeout (currently 15 seconds)
Here goes screenshots from NewRelic:
I can see that other pods are doing well, and also that after initial struggle, all pods are handling the load without any issue.
Here goes what I did when attempting to fix it:
Get rid of non-async calls. I had few lines of blocking code inside async methods. I've changed everything to async. I do not longer have non-async methods.
Removed long-running transactions. We had long running transactions, like this:
beginTransactionAsync
selectDataAsync
saveDataAsync
commitTransactionAsync
which I refactored to:
- selectDataAsync
- saveDataAsync // under-the-hood EF core short lived transaction
This helped a lot, but did not solve problem completely.
Ensure some connections are always open and ready. We added Minimum Pool Size=20 to connection string, to always keep at least 20 connections open.
This also helped, but still sometimes pods struggle.
Our pods starts properly after Readiness probe returns success. Readiness probe checks connection to the DB using standard .NET core healthcheck.
Our connection string have MaxPoolSize=100;Timeout=15; setting.
I am of course expecting that initially new pod instance will need some spin-up time when it operates at lower parameters, but I do not expect that so often pod will suffocate and throw 90% of errors.
Important note here:
I have 2 different DbContexts sharing same connection string (thus same connection pool). Each of this DbContext accesses different schema in the DB. It was done to have a modular architecture. DbContexts never communicate with each other, and are never used together in same request.
My current guess, is that when pod is freshly created, and receives a huge load immediately, the pod tries to open all 100 connections (it is visible in DB open sessions chart) which makes it too much at the beginning. What can be other reason? How to make sure that pod does operate at it's optimal performance from very beginning?
Final notes:
DB processor is not at its max (about 30%-40% usage under heaviest load).
most of the SQL queries and commands are relatively simple SELECT and INSERT
after initial part, queries are taking no more than 10-20ms each
I don't want to patch the problem with increasing number of connections in pool to more than 100, because after initial struggle pods operates properly with around 50 connections in use
I rather not have connection leak because in such case it will throw exceptions after normal load as well, after some time
I use scoped DbContext, which is disposed (thus connection is released to the pool) at the end of each request
EDIT 25-Nov-2020
My current guess is that new created pod is not receiving enough of either BANDWITH resource or CPU resource. This reasoning is supported by fact that even those requests which DOES NOT include querying DB were struggling.
Question: is it possible that new created pod is granted insufficient resources (CPU or network bandwidth) at the beginning?
EDIT 2 26-Nov-2020 (answering Arca Artem)
App runs on k8s cluster on AWS.
App connects to DB using standard ADO.NET connection pool (max 100 connections per pod).
I'm monitoring DB connections and DB CPU (all within reasonable limits). Hosts CPU utilization is also around 20-25%.
I thought that when pod start, and /health endpoint responds successfully (it checks DB connections, with simple SELECT probe) and also pod's max capacity is e.g. 200rps - then this pod is able to handle this traffic since very first moment after /health probe succeeded. However, from my logs I see that after '/health' probe succeed 4 times in a row under 20ms, then traffic starts coming in, few first seconds of pod handling traffic is taking more than 5s per request (sometimes even 40seconds per req).
I'm NOT monitoring hosts network.

At this point it's just a speculation on my part without knowing more about the code and architecture, but it's worth mentioning one thing that jumps out to me. The health check might not be using the normal code path that your other endpoints use, potentially leading to a false positive. If you have the option, use of a profiler could help you pin-point exactly when and how this happens. If not, we can take educated guesses where the problem might be. There could be a number of things at play here, and you may already be familiar with these, but I'm covering them for completeness sake:
First of all, it's worth bearing in mind that connections in Postgres are very expensive (to put it simply, it's because it's a fork on the database process) and your pods are consequently creating them in bulk when you scale your app all at once. A relatively considerable time is needed to set each one up and if you're doing them in bulk, it'll add up (how long is dependent on configuration, available resources..etc).
Assuming you're using ASP.NET Core (because you mentioned DbContext), the initial request(s) will take the penalty of initialising the whole stack (create min required connections in the pool, initialise ASP.NET stack, dependencies...etc). Again, this will all depend on how you structure your code and what your app is actually doing during initialisation. If your health endpoint is connecting to the DB directly (without utilising the connection pool), it would mean skipping the costly pool initialisation resulting in your initial requests to take the burden.
You're not observing the same behaviour when your load increases gradually possibly because usually these things are an interplay between different components and it's generally a non-linear function of available resources, code behaviour...etc. Specifically if it's just one new pod that spun up, it'll require much less number of connections than, say, 5 new pods spinning up, and Postgres would be able to satisfy it much quicker. Postgres is the shared resource here - creating 1 new connection would be significantly faster than creating 100 new connections (5 pods x 20 min connections in a pool) for all pods waiting on a new connection.
There are a few things you can do to speed up this process with config changes, using an external connection pooler like PgBouncer...etc but they won't be effective unless your health endpoint represents the actual state of your pods.
Again it's all based on assumptions but if you're not doing that already, try using the DbContext in your health endpoint to ensure the pool is initialised and ready to take connections. As someone mentioned in the comments, it's worth looking at other types of probes that might be better suited to implementing this pattern.

I found ultimate reason for the above issue. It was insufficient CPU resources assigned to pod.
Problem was hard to isolate because NewRelic APM CPU usage charts are calculated in different way than expected (please refer to NewRelic docs). The real pod CPU usage vs. CPU limit can be seen only in NewRelic kubernetes cluster viewer (probably it uses different algorithm to chart CPU usage there).
What is more, when pod starts up, it need a little more CPU at the beginning. On top of it, the pods were starting because of high traffic - and simply, there was no enough of CPU to handle these requests.

Related

Locust eats CPU after 2-3 hours running

I have a simple HTTP server that I was testing. This server interacts with other HTTP servers and Cassandra DB.
Currently I was using 100 users with 1 request/s, so totally 100 tps was on the server. What I noticed with the Docker stats was that the CPU usage became higher and higher and ~ 2-3 hours later the CPU usage reaches the 90% mark, and even more. After that I got a notice from Locust, stating that the measurement may be inconsistent. But the latencies were not increased, so I do not know why this has been happening.
Can you please suggest possible cause(s) of the problem? I think 100 tps should be handled by one vCPU.
Thanks,
AM
There's no way for us to know exactly what's wrong without at very least seeing some code, and even then other factors like the environment or data or server you're running it on or against could have additional factors we wouldn't know about.
It's possible you have a problem with your code for your Locust users, such as a memory leak or they're just doing too much for a single worker to handle that many users. For users only doing simple HTTP calls, a single CPU typically can handle upwards of thousands of requests per second. Do anything more than that and you'll start to expect to reduce what a worker can handle. It's also possible you may just need a more powerful CPU (or more RAM or bandwidth) to do what you want it to do at the scale you want.
Do some profiling to see if you can find any inefficiencies in your code. Run smaller tests to see if the same behavior is evident with smaller loads. Run the same load but with additional Locust workers on other CPUs.
It's also just as possible your DB can't handle the load. The increasing CPU usage could be due to how your code is handling waiting on the connection from the DB. Perhaps the DB could sustain, say, 80 users at an acceptable rate but any additional users makes it fall further and further behind and your Locust users are then waiting longer and longer for the requested data.
For more suggestions, check out the Locust FAQ https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps

Increased latency when using #Transactional(readOnly=true)

I am working with a backend service (Spring Boot 2.2.13.RELEASE + Hikari + JOOQ) that uses an AWS Aurora PostgreSQL DB cluster configured with a Writer (primary) node and a Reader (Read Replica) node. The reader node has just been sitting there idle/warm waiting to be promoted to primary in case of fail-over.
Recently, we've decided to start serving queries exclusively from the Reader Node for some of our GET endpoints. To achieve this we used a "flavor" of RoutingDataSource so that whenever a service is annotated with #Transactional(readOnly=true) the queries are performed against the reader datasource.
Until here everything was going smooth. However after applying this solution I've noticed a latency increase up to 3x when compared with the primary datasource.
After drilling down on this I found out that each transaction was doing a couple of extra round trips to the db to SET SESSION CHARACTERISTICS:
SET SESSION CHARACTERISTICS READ ONLY
ACTUAL QUERY/QUERIES
SET SESSION CHARACTERISTICS READ WRITE
To improve this I tried to play with the readOnlyMode setting that was introduced in pg-jdbc pg-jdbc 42.2.10. This setting allows to control the behavior when a connection is set to read only (readOnly=true).
https://jdbc.postgresql.org/documentation/head/connect.html
In my first attempt I used readOnly=true and readOnlyMode=always. Even though I stooped seeing the SET SESSION CHARACTERISTICS statements, the latency remained unchanged. Finally I tried to use readOnly=false and readOnlyMode=ignore. This last option caused the latency to decrease however it is still worse than it was before.
Has someone else experience with this kind of setup? What is the optimal configuration?
I don't have a need to flag the transaction as read only (besides to tell the routing datasource to use the read replica instead) so I would like to figure out if it's possible to do anything else so that the latency remains the same between the Writer an Reader Nodes.
Note: At current moment the reader node is just service 1% of all the traffic (+- 20req/s).

Handle sudden increase in traffic size (multiple orders of magnitude) with GKE

If a website has a door crasher sale where many people (~50K) are waiting for the countdown to finish and enter the page, how would one tackle this with GKE in a cost efficient way?
That seems to be the reason GKE exists, the solution could be that with cluster autoscaler and HPA, GKE can handle the traffic. In practice however it is a different story, when the autoscaler tries to create nodes and pull the image for containers it may take up to a certain time (perhaps up to a min or two in some cases). During that time users see 5XX errors which is not ideal.
Well to tackle that, over-provisioning with paused pods come to mind. However, considering the servers are generally very small in size (they should only handle 100 requests in a normal day) and all of a sudden 50K in a second, how would this be a feasible solution? Paused pods seems to only make sure the autoscaler don't remove nodes that are not working, so in that case 50 nodes must always be occupied with paused pods which I am assuming the running hours are still billable (since nodes are there just not doing anything) in GKE.
What would a feasible solution to serve 100 requests with n1-standard-1 everyday but also be able to scale to ~50k in less than 10 seconds?
Not as fast as 10 seconds. That's reachable only if you go serverless.
Pods autoscaling best is 20-30 seconds (depends on your readiness probes, probes of loadbalancer, image cache etc). But you still have to have a pool of nodes to fit that capacity, which is the same money - you're right.
Nodes+Pods autoscaling is around 5 minutes.
If you go serverless, make sure you know (increase?) your account limits. Because it scales so fast and billed per lambda-run - it was very easy to accidentally blow up your bill. Thus all providers limited the default amount of concurrent function executions, e.g. AWS has 1000 per account by default. https://aws.amazon.com/about-aws/whats-new/2017/05/aws-lambda-raises-default-concurrent-execution-limit/. This can be increased through support.
I recall this post for AWS: https://aws.amazon.com/blogs/startups/from-0-to-100-k-in-seconds-instant-scale-with-aws-lambda/. Unfortunately didn't see similar writes for google functions, but I'm sure they have very similar capabilities.

Multiple node pools vs single pool with many machines vs big machines

We're moving all of our infrastructure to Google Kubernetes Engine (GKE) - we currently have 50+ AWS machines with lots of APIs, Services, Webapps, Database servers and more.
As we have already dockerized everything, it's time to start moving everything to GKE.
I have a question that may sound too basic, but I've been searching the Internet for a week and did not found any reasonable post about this
Straight to the point, which of the following approaches is better and why:
Having multiple node pools with multiple machine types and always specify in which pool each deployment should be done; or
Having a single pool with lots of machines and let Kubernetes scheduler do the job without worrying about where my deployments will be done; or
Having BIG machines (in multiple zones to improve clusters' availability and resilience) and let Kubernetes deploy everything there.
List of consideration to be taken merely as hints, I do not pretend to describe best practice.
Each pod you add brings with it some overhead, but you increase in terms of flexibility and availability making failure and maintenance of nodes to be less impacting the production.
Nodes too small would cause a big waste of resources since sometimes will be not possible to schedule a pod even if the total amount of free RAM or CPU across the nodes would be enough, you can see this issue similar to memory fragmentation.
I guess that the sizes of PODs and their memory and CPU request are not similar, but I do not see this as a big issue in principle and a reason to go for 1). I do not see why a big POD should run merely on big machines and a small one should be scheduled on small nodes. I would rather use 1) if you need a different memoryGB/CPUcores ratio to support different workloads.
I would advise you to run some test in the initial phase to understand which is the size of the biggest POD and the average size of the workload in order to properly chose the machine types. Consider that having 1 POD that exactly fit in one node and assign to it is not the right to proceed(virtual machine exist for this kind of scenario). Since fragmentation of resources would easily cause to impossibility to schedule a large node.
Consider that their size will likely increase in the future and to scale vertically is not always this immediate and you need to switch off machine and terminate pods, I would oversize a bit taking this issue into account and since scaling horizontally is way easier.
Talking about the machine type you can decide to go for a machine 5xsize the biggest POD you have (or 3x? or 10x?). Oversize a bit as well the numebr of nodes of the cluster to take into account overheads, fragmentation and in order to still have free resources.
Remember that you have an hard limit of 100 pods each node and 5000 nodes.
Remember that in GCP the network egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. However each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
Regarding the prices of the virtual machines notice that there is no difference in price buying two machines with size x or one with size 2x. Avoid to customise the size of machines because rarely is convenient, if you feel like your workload needs more cpu or mem go for HighMem or HighCpu machine type.
P.S. Since you are going to build a pretty big Cluster, check the size of the DNS
I will add any consideration that it comes to my mind, consider in the future to update your question with the description of the path you chose and the issue you faced.
1) makes a lot of sense as if you want, you can still allow kube deployments treat it as one large pool (by not adding nodeSelector/NodeAffinity) but you can have different machines of different sizes, you can think about having a pool of spot instances, etc. And, after all, you can have pools that are tainted and so forth excluded from normal scheduling and available to only a particular set of workloads. It is in my opinion preferred to have some proficiency with this approach from the very beginning, yet in case of many provisioners it should be very easy to migrate from 2) to 1) anyway.
2) As explained above, it's effectively a subset of 1) so better to build up exp with 1) approach from day 1, but if you ensure your provisioning solution supports easy extension to 1) model then you can get away with starting with this simplified approach.
3) Big is nice, but "big" is relative. It depends on the requirements and amount of your workloads. Remember that while you need to plan for loss of a whole AZ anyway, it will be much more frequent to loose single nodes (reboots, decommissions of underlying hardware, updates etc.) so if you have more hosts, impact of loosing one will be smaller. Bottom line is that you need to find your own balance, that makes sense for your particular scale. Maybe 50 nodes is too much, would 15 cut it? Who knows but you :)

In Oracle RAC, will an application be faster, if there is a subset of the code using a separate Oracle service to the same database?

For example, I have an application that does lots of audit trails writing. Lots. It slows things down. If I create a separate service on my Oracle RAC just for audit CRUD, would that help speed things up in my application?
In other words, I point most of the application to the main service listening on my RAC via SCAN. I take the subset of my application, the audit trail data manipulation, and point it to a separate service listening but pointing same schema as the main listener.
As with anything else, it depends. You'd need to be a lot more specific about your application, what services you'd define, your workloads, your goals, etc. Realistically, you'd need to test it in your environment to know for sure.
A separate service could allow you to segregate the workload of one application (the one writing the audit trail) from the workload of other applications by having different sets of nodes in the cluster running each service (under normal operation). That can help ensure that the higher priority application (presumably not writing the audit trail) has a set amount of hardware to handle its workload even if the lower priority thread is running at full throttle. Of course, since all the nodes are sharing the same disk, if the bottleneck is disk I/O, that segregation of workload may not accomplish much.
Separating the services on different sets of nodes can also impact how frequently a particular service is getting blocks from the local node's buffer cache rather than requesting them from the other node and waiting for them to be shipped over the interconnect. It's quite possible that an application that is constantly writing to log tables might end up spending quite a bit of time waiting for a small number of hot blocks (such as the right-most block in the primary key index for the log table) to get shipped back and forth between different nodes. If all the audit records are being written on just one node (or on a smaller number of nodes), that hot block will always be available in the local buffer cache. On the other hand, if writing the audit trail involves querying the database to get information about a change, separating the workload may mean that blocks that were in the local cache (because they were just changed) are now getting shipped across the interconnect, you could end up hurting performance.
Separating the services even if they're running on the same set of nodes may also be useful if you plan on managing them differently. For example, you can configure Oracle Resource Manager rules to give priority to sessions that use one service over another. That can be a more fine-grained way to allocate resources to different workloads than running the services on different nodes. But it can also add more overhead.