"Hibernating" service fabric application - entity-framework-core

I have a kinda small service fabric application that I'm building and have since I converted to service fabric been annoyed about the slow startup time and it's not only after a release but also after like 10-15 min of inactivity.
I have added a project whose sole purpose is to go to each service and make a small db request every 10s, thinking that will keep the application and ef running. This helped me from getting timeouts and now the first requests are in the 5-15s range. After some warming up the requests are usually in the 300ms range so they are quite easy requests and there isn't much communication between the services (4 services in total).
After a lot of searching I found a profiler that seems to work as most doesn't like the one in visual studio. Unfortunately it didn't really say that much except that it waits for threads a lot and that it doesn't seem to be in my code. All my external requests use await async. Also when following the request it kinda seemed like there were information missing...
At first I thought that the slowness might come from ef generating the search query so I migrated that part to use dapper instead (the full request still uses some ef) but that didn't change anything really.
The application has all the latest service fabric, dotnet core, ef core, application insights packages. All services except for the one validating tokens are stateless. And of course built in release mode.
At this point I'm kinda lost as I cannot find the reason it's so slow. In the old days this was usually because of IIS shutting down the application or recycling it but now when it isn't there, what can it be?

Similar issue happen to us however we use DI container and until the first call to our service, all dependency is not resolved and it take time to create these instances. For example a singleton of class. Another one is was EF DB context. To overcome that we have process to "warm" the services first.
Hope that helps

This might be a shot in the dark: Are your services communicating using the Service Fabric remoting options or using HTTP? In the case of HTTP, might the hibernation and warmup time be caused by HttpSys/Kestrel?
Regarding your slow responses (300ms) that does seem a bit odd, we have multiple stateless services (using HTTP and Kestrel) with EF in the back, and have sub 50ms response times).

Related

Error using the connection to database when RDS scales out

We have a .net API hosted in ECS that queries data from a serverless v1 cluster using Entity Framework. Under normal load this service performs very well but when there's a large spike in traffic that require the RDS cluster to scale out to more ACUs we are seeing a lot of connection errors in our API.
An error occurred using the connection to database '\"ourdatabasename\"' on server '\"tcp://ourcluster.region.rds.amazonaws.com:5432\"'.
The high level overiew of the infrastructure looks like this:
CloudFront >> Load Balancer >> ECS Fargate >> RDS Aurora PostgeSQL Serverless v1
Stack information:
.Net 6 API compiled for Linux
Entity Framework Core 6.x
Npgsql.EntityFrameworkCore.PostgreSQL 6.x
PostgeSQL 10.18
We did open AWS support cases about this issue in the past year, but those basically always result in the answer that this is an implementation issue and not an infrastructure issue.
We can easily reproduce the issue by running a k6 stress test on our API (bypassing the CloudFront caching layer of course) to generate a spike high enough to trigger scaling of the RDS cluster.
For the past year we have worked around this issue by configuring RDS at a capacity at which it basically never needs to scale out. This is of course wasting money, and not the purpose of serverless as all, so we would like to find the underlying root cause and solve that.
Some things we have tried already:
We have experimented with serverless v2 which should scale in a completely different fashion as it's just the same vm consuming more resources from the hosting machine. But our preliminary conclusion is that this was even worse. We do not yet understand why that is, but it appears to trigger the same effect but than a lot faster/more as v2 scales a lot faster/more. With v1 we get in trouble around 400 requests per second, with v2 it was at 150rps.
EnableRetryOnFailure seemed to help a tiny bit, but not a lot. We have left it at the default configuration as implemented by Npgsql for now.
We have experimented with the Maximum Pool Size connection string parameter. At 300 it appears to be a bit better, but it does not solve the issue.
Changing the scaling behaviour of ECS/the ALB or even just prescaling that to handle peak load did not change anything.
We have not tried:
RDS Proxy, it's supposed to solve all your connection pooling issues. But we're not sure it's even a pooling issue. We're not keen on trusting on yet another black box service to solve the issues our first black box service (aurora serverless) has. And it's not really cheap. If all of SO will now convince us this is the holy grail, then surely we'll try it out.
Data API for RDS, you can't have connection management issues if you're not making them right? It's a huge investment to rewrite all EF code to Data API requests and I'm not sure what it says about the service if it's still not out for serverless v2. So, not for now I think.
The first purpose of this question here on SO it trying to find someone that could help us understand what is even going on. Helping us understand the error and where it comes from. We understand that you cannot expect that ECS+RDS can just magically handle all the load you throw at it. But if we do not fully understand how it breaks we are not able to come up with how to create potential failover mechanisms or how to make the system fail more gracefully.
If someone knows the magic setting but not the why that's also great of course :) We can then maybe figure out the why ourselves and share that back with the community ;)
Feel free to ask more questions where needed.

How to make sessions persistent in Scalatra?

I have a webapp using the Scala-based Scalatra web framework. The problem is, anytime the application is re-deployed, or anytime the app-server is rebooted, all session data is lost. This means (to name one downside) users must re-login every time we make an update to the site.
Some research reveals there are, apparently, "container-specific" ways to make sessions persist across app and server reboots (e.g., in the case of Tomcat), but this has two shortcomings:
If the app is not always deployed in the same container (and in the case of Scalatra, an embedded Jetty is used for dev purposes) then I'll need separate configuration for each container.
Using a server-local configuration file is much more fickle -- it's likely to get lost in server migrations, and it won't be automatically available to each instance (e.g., to each developer) of the app, whereas something stored with the core application code is much easier to test, retain, and generally keep track of.
So, to sum up...
Is there a generic, container-neutral way to make sessions persistent? Even if only by overriding appropriate methods in the Java/Servlet stack and storing the session data manually?
Barring that, is there a way to store relevant configuration for multiple containers (e.g., for both Jetty and Tomcat) in my application code (web.xml or similar)?
Thanks -- any insights appreciated!

What are the limitations of the flask built-in web server

I'm a newbie in web server administration. I've read multiple times that flask built-in web server is not designed for "production", and must be used only for tests and debug...
But what if my app touchs only a thousand users who occasionnaly send data to the server ?
If it works, when will I have to bother with the configuration of a more sophisticated web server ? (I am looking for approximative metrics).
In a nutshell, I would love to find what the builtin web server can do (with approx thresholds) and what it cannot.
Thanks a lot !
There isn't one right answer to this question, but here are some things to keep in mind:
With the right amount of horizontal scaling, it is quite possible you could keep scaling out use of the debug server forever. When exactly you would need to start scaling (or switch to using a "real" web server) would also depend on the environment you are hosting in, the expectations of the users, etc.
The main issue you would probably run into is that the server is single-threaded. This means that it will handle each request one at a time, serially. This means that if you are trying to serve more than one request (including favicons, static items like images, CSS and Javascript files, etc.) the requests will take longer. If any given requests happens to take a long time (say, 20 seconds) then your entire application is unresponsive for that time (20 seconds). This is only the default, of course: you could bump the thread counts (or have requests be handled in other processes), which might alleviate some issues. But once again, it can still be slow under a "high" load. What is considered a "high" load will be dependent on your application and the expectations of a maximum acceptable response time.
Another issue is security: if you are concerned at ALL about security (and not just the security of the data in the application itself, but the security of the box that will be running it as well) then you should not use the development server. It is not ready to withstand any sort of attack.
Finally, the development server could just fail outright. It is not designed to be used as a long-running process (days, weeks, months), and so it has not been well tested to work in this capacity.
So, yes, it has limitations. Yes, you could still conceivably use it in production. And yes, I would still recommend using a "real" web server. If you don't like the idea of needing to install something like Apache or Nginx, you can still go with a solution that is still as easy as "run a python script" by using some of the WSGI Standalone servers, which can run a server that is designed to be in production with something just as simple as running python run_app.py in the command line. You typically just need to create a 4-5 line python script to import and create the server object, point it to your Flask app, and run it.
gunicorn could be run with only the following on the command line, no extra script needed:
gunicorn myproject:app
...where "myproject" is the Python package that contains the app Flask object. Keep in mind that one of developers of gunicorn would probably recommend against this approach. See https://serverfault.com/questions/331256/why-do-i-need-nginx-and-something-like-gunicorn.
The OP has long-since moved on, but for those who encounter this question in the future I would just add that setting up an Apache server, even on a laptop, is free and pretty easy. It can be readily configured for as few or as many features as you want just by uncomment in or commenting out lines in the config file. There might be an even easier GUI method for doing that nowdays, but just editing the configs is simple.

How Do I Optimize Zend Framework

I have a application built on Zend Framework I am trying to optimize.
I did some Xdebug profiling and although i cant say i understand every nitty gritty of the results i got, some things were quite obvious from the result.
For instance, the file Bootstrap.php seems to be the one gulping most of the time taking 4,553MS seconds which accounts for 92.49% of the total time.
And if i dig further, I could see that Zend_Application_Bootstrap_Boostrap->run takes the bulk of the time. Checking this out again, I found out that Zend_Controller_Front->Dispatch might actually be the function inside the Boostrap.php that takes time to execute.
Question is, from these indices that i have, how best can I go about Optimizing the application? If it caching, how do i go about applying Caching to this situation?
Thanks
From the look of the callgrinds, on the login page the app is spending most of it's time in curl_exec, which is to be expected if you're doing a remote login. But it is doing 10 separate curl_execs which seems excessive. I'm not familiar with the LinkedIn login auth, but is it possible your app is running the remote login code multiple times?
On the standard page request the app is spending most of its time connecting to MySQL, and it seems to be doing this twice. Are you using a remote DB server, and do you need two separate DB connections?
Assuming you are using a remote DB server and it is on the same network as your web server, there seems to be some networking issue there. I'd check the latency to that server if you can, and try connecting to the IP address instead of a hostname to see if that makes any difference (if doing this is much faster this would suggest an issue with the DNS setup on your web server).

Upgrading an app running on Lift Framework?

I've recently discovered the lift framework and have read that it's stateful.
Therefore, if I had a high-traffic site running on Lift - say something that was running a chat application that required users to be logged in - and I wanted to upgrade my app, would doing so kick everyone out of chat and make them have to log in again?
None of the previous answers are correct. Many of the artefacts held within the LiftSession are non-serilizable, so cant be stuffed into a database. You have two options for doing rollig upgrades of stateful applications:
1) Session bleeding. Basically you ween one of the deployments sessions away until their sessions have ended or X duration passes and then you remove the app from production whilst automatically rerouting traffic to another instance of Lift. Google around for rolling upgrades using HAProxy as this should help you from the cluster perspective.
2) If your state is fairly trivial (mostly primitive-style types: ints, strings etc) then you could think about using ContainerVar/MigratableSession and clustering the state using terracotta or similar. This comes with a range of limits though because it then uses the HTTPSession rather than LiftSession.
You might want to checkout chapter 15 of Lift in Action which details that latter solution in a fair amount of detail.
If you keep your state in memory and redeploy the web application, that state will be lost. You could save it to a database or a file before redeploying though and read it back from there.