for production mode is it always recommended to use minimum 2 server?
because 1 server could crash. The load is always within the green range.
Thx
This depends on the requirements. How much downtime is allowed? In many cases it is OK to wait a few seconds until a new container is started. In other cases you can't even alow a few seconds and have to run a hot stand-by in parallel.
Put 2 or more and a load balancer in front of it, or 3 containers or more. It will prevent downtime. How #lexicore pointed out, it depends on your requirements. Maybe a few minutes out of service is not a big deal, maybe it is.
Related
We’re using a standard 3-node Atlas replicaset in a dedicated cluster (M10, Mongo 6.0.3, AWS) and have configured an alert if the ‘Restarts in last hour is’ rule exceeds 0 for any node.
https://www.mongodb.com/docs/atlas/reference/alert-conditions/#mongodb-alert-Restarts-in-Last-Hour-is
We’re seeing this alert fire every now and then and we’re wondering what this means for a node in a dedicated cluster and whether this is something to be concerned about, since I don’t think we have any control over it. Should we should disable this rule or increase the restart threshold?
Thanks in advance for any advice.
(Note I've asked this over at the Mongo community support site also, but haven't received any traction yet so asking here too)
I got an excellent response on my question at the Mongo community support site:
A node restarting is not necessarily a cause for concern. However, you should investigate the cause of the restart itself to better determine if this is an issue or not. You should take a look at your Project Activity Feed to see if you can determine why the nodes are restarting. I understand you have noted this is an M10 cluster so you should have access to the MongoDB logs, you also can check those to try determine the cause of the node restart. If you do not have access to the logs, you can consider working with Atlas in-app chat support to diagnose the issue.
It’s always good to keep the alerts active, as they can indicate a potential problem as soon as they occur. You can consider increasing the restart threshold to reduce alert noise after concluding whether the restarts are expected or not.
In my case, having checked the activity feed I was able to match up all the alerts we were seeing to Mongo version auto-updates on the nodes. We still wanted to keep that so we've increased our alert threshold to fire on >1 restart per hour rather than >0 restart, assuming that auto-updates won't be applied multiple times in the same hour.
I'm doing some load testing on a microservice application. Collected the percentile statistics and plotted them. The application is running in a shared K8s cluster. The thing I am not quite understanding is why is there a latency spike in the start? Is this an issue with a cold boot?
Locust plot showing RT over time
Is this an issue with a cold boot?
Yes, this is the most likely explanation. There's no way of knowing without digging into your application and its logs though.
Most applications, especially ones that do automatic scaling, perform very poorly when suddenly hit with a large amount of load. If your actual expected user load does not have this behaviour, then maybe a slower ramp-up is more appropriate.
If you havent already read this, then maybe have a look at https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps
It's been almost 3 months I have switched my platform to Google Cloud (Compute Engine + Cloud SQL + Cloud Storage).
I am very happy with it but from time to time I noticed big latency on the Cloud SQL server. My VMs from Compute Engine and my Cloud SQL instance are all on the same location (us-1) datacenter.
Since my Java backend makes a lot of SQL queries to generate a server response, the response times may vary from 250-300ms (normal) up to 2s!
In the console, I notice absolutely nothing: no CPU peaks, no read/write peaks, no backup running, nothing. No alert. Last time it happened, it lasted for a few days and then the response times went suddenly better than ever.
I am pretty sure Google works on the infrastructure behind the scenes... But no way to point that out.
So here's my questions:
Has anybody else ever had noticed the same kind of problem?
It is really annoying for me because my web pages get very slow and I have absolutely no control over it. Plus I loose a lot of time because I generally never first suspect a hardware problem / maintenance but instead something that we introduced in our app. Is it normal or do I have a problem on my SQL instance?
Is there anywhere I can have visibility over what's Google doing on the hardware? I know there are maintenance alerts, but for my zone it seems always empty when it happen.
The only option I have for now is to wait and that is really not acceptable.
I suspect that Google does some sort of IO throttling and their algorithm is not very sophisticated. We have a build server which slows down to a crawl if we do more than two builds within an hour. The build that normally takes 15 minutes will run for more than an hour and we usually terminate it and re-run manually later. This question describes a similar problem and the recommended solution is to use larger volumes as they come with more IO allowance.
I have an application that I am 'clusterising' to run across multiple JBoss AS 7.2.0 nodes in standalone-full-ha mode with mod_cluster in front (using sticky sessions)
The cluster works...kind of, I am constantly running into issues with ConcurrentModificationException's and Serialization errors from Infinispan as it is constantly replicating state between nodes, the application does a lot of long-running processing work so there is a lot of arrays stored in memory at any given time which I assume is one of the causes
I have spent a fair amount of time trying to work around these but I think I am fighting a losing battle
Since I do not really need 'High Availability', is it possible to configure JBoss (or Infinispan subsystem) to only replicate my Session/EJB's on demand?
Eg. The only time I really need it is when I am taking a node down and I want to move its state to another node, so I would like to be able to trigger it from within my application
...and if this is not possible, how can I disable Session/Ejb replication entirely?
I ask because I had an app working perfectly in staging, but now that it is in production, fiddler tells me the response is a 502 error when I request the page in a browser. Has anyone any idea what might cause this? I could simply leave it in staging, its academic work so its not a big deal, but it is annoying though. Iv waited at least 30 mins and still the same result so I doubt if it is going to do anything.
I believe there's no difference except DNS addressing - and you should be able to swap Staging and Production rather than uploading straight to Production.
One big difference for me is that the staging has a random guid as part of the domain so I can't easily reference in my testing. I also can't easily set it as a target for my web deployments.
So, when I'm in dev mode, I tend to just have two separate web role projects that are both non-guid names and use one as staging and one as relatively stable production for other team members.
That said, for real production, I do use staging because I can test the VIP swap which is a much better way to go live than waiting 20 minutes to see if it worked. (and what if it did not)?
My 2cents.
Resolved, I guess I had to wait 32 minutes. GAE FTW