How to Recover from Fail State in MongoDB Ops Manager - mongodb

I upgrade MongoDB Ops Manager from 4.2.4 to 4.4.6. Then it told me that MongoDB Agent, MongoDB Tools are out of date. After I press the upgrade button, the upgrade process is stucked.
The Agent log has the following error:
Interface conversion: interface {} is nil, not float64
Now the Ops Manager is stuck at the "deploying" stage. How to revert to the previous state or fix the problem?
$ kubectl exec -it pod - is useless as the pod are running under a security context which uses user "2000"
https://docs.mongodb.com/kubernetes-operator/master/reference/troubleshooting/
is almost useless as we don't have permission to change the pod state. So what's the right way to fix deployment issues when using Ops Manager?

Related

Retrieve timestamp of k8s upgrades

kubectl version does not show when the cluster was upgraded with the current version. Is there a way to get the cluster upgrade timestamp or better timestamps of all previous upgrades?
We can see a list of every running and completed operations in the cluster using the following command :
gcloud beta container operations list
Each operation is assigned an operation ID , operation type, start and end times, target cluster, and status.
To get more information about a specific operation, specify the operation ID in the following command:
gcloud beta container operations describe OPERATION_ID -z region

Liquibase causing PostgreSQL DB Lock - Microservices running as a pod in AKS

We are multiple microservices(With Liquibase) running on Azure AKS cluster as a pod.
Frequently we have noticed DB locks and pods will crash as it will fail in health checks.
Is there a way to overcome this scenario as it is impacting a lot. We have to manually unlock DB table, so that pod will start.
In one of the logs, I’ve noticed below error
I believe, it needs to be handled from Application(Springboot).
You can write a piece of code that executes at the start of application that will release the lock if found. Then the database connection won't fail.
Currently using the same for our environment.

Kubernetes - Upgrading Kubernetes-cluster version through Terraform

I assume there are no stupid questions, so here is one that I could not find a direct answer to.
The situation
I currently have a Kubernetes-cluster running 1.15.x on AKS, deployed and managed through Terraform. AKS recently Azure announced that they would retire the 1.15 version of Kubernetes on AKS, and I need to upgrade the cluster to 1.16 or later. Now, as I understand the situation, upgrading the cluster directly in Azure would have no consequences for the content of the cluster, I.E nodes, pods, secrets and everything else currently on there, but I can not find any proper answer to what would happen if I upgrade the cluster through Terraform.
Potential problems
So what could go wrong? In my mind, the worst outcome would be that the entire cluster would be destroyed, and a new one would be created. No pods, no secrets, nothing. Since there is so little information out there, I am asking here, to see if there are anyone with more experience with Terraform and Kubernetes that could potentially help me out.
To summary:
Terraform versions
Terraform v0.12.17
+ provider.azuread v0.7.0
+ provider.azurerm v1.37.0
+ provider.random v2.2.1
What I'm doing
§ terraform init
//running terrafrom plan with new Kubernetes version declared for AKS
§ terraform plan
//Following changes are announced by Terraform:
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
#module.mycluster.azurerm_kubernetes_cluster.default will be updated in-place...
...
~ kubernetes_version = "1.15.5" -> "1.16.13"
...
Plan: 0 to add, 1 to change, 0 to destroy.
What I want to happen
Terraform will tell Azure to upgrade the existing AKS-service, not destroy before creating a new one. I assume that this will happen, as Terraform announces that it will "update in-place", instead of adding new and/or destroying existing clusters.
I found this question today and thought I'd add my experience as well. I made the following changes:
Changed the kubernetes_version under azurerm_kubernetes_cluster from "1.16.15" -> "1.17.16"
Changed the orchestrator_version under default_node_pool from "1.16.15" -> "1.17.16"
Increased the node_count under default_node_pool from 1 -> 2
A terraform plan showed that it was going to update in-place. I then performed a terraform apply which completed successfully. kubectl get nodes showed that an additional node was created, but both nodes in the pool were still on the old version. After further inspection in Azure Portal it was found that only the k8s cluster version was upgraded and not the version of the node pool. I then executed terraform plan again and again it showed that the orchestrator_version under default_node_pool was going to be updated in-place. I then executed terraform apply which then proceeded to upgrade the version of the node pool. It did that whole thing where it creates an additional node in the pool (with the new version) and sets the status to NodeSchedulable while setting the existing node in the pool to NodeNotSchedulable. The NodeNotSchedulable node is then replaced by a new node with the new k8s version and eventually set to NodeSchedulable. It did this for both nodes. Afterwards all nodes were upgraded without any noticeable downtime.
I'd say this shows that the Terraform method is non-destructive, even if there have at times been oversights in the upgrade process (but still non-destructive in this example): https://github.com/terraform-providers/terraform-provider-azurerm/issues/5541
If you need higher confidence for this change then you could alternativly consider using the Azure-based upgrade method, refreshing the changes back into your state, and tweaking the code until a plan generation doesn't show anything intolerable. The two azurerm_kubernetes_cluster arguments dealing with version might be all you need to tweak.

Service fabric fails to roll back application when deployment fails

I have a 3 node cluster for service fabric where the deployment is stuck for 10hr on the third node. Looking at the SF explorer we saw that there is wrong SQL creds being passed hence the deployment is stuck.
1) Why is SF recognizing it at a "warning" rather than an "Error"
2) Why is it stuck and not doing a roll back?
3) Is there extra setting I need to do so it does auto rollback sooner?
Generally, it rollback when the deployment fail, but it will depend on the parameter you pass for the upgrade, like FailureAction, UpgradeMode and Timeouts.
UpgradeMode values can be:
Monitored: Indicates that the upgrade mode is monitored. After the cmdlet finishes an upgrade for an upgrade domain, if the health of the upgrade domain and the cluster meet the health policies that you define, Service Fabric upgrades the next upgrade domain. If the upgrade domain or cluster fails to meet health policies, the upgrade fails and Service Fabric rolls back the upgrade for the upgrade domain or reverts to manual mode per the policy specified on FailureAction. This is the recommended mode for application upgrades in a production environment.
Unmonitored Auto: Indicates that the upgrade mode is unmonitored automatic. After Service Fabric upgrades an upgrade domain, Service Fabric upgrades the next upgrade domain irrespective of the application health state. This mode is not recommended for production, and is only useful during development of an application.
Unmonitored Manual: Indicates that the upgrade mode is unmonitored manual. After Service Fabric upgrades an upgrade domain, it waits for you to upgrade the next upgrade domain by using the Resume-ServiceFabricApplicationUpgrade cmdlet.
FailureAction is the compensating action to perform when a Monitored upgrade encounters monitoring policy or health policy violations. The values can be:
Rollback specifies that the upgrade will automatically roll back to the pre-upgrade version.
Manual indicates that the upgrade will switch to the UnmonitoredManual upgrade mode.
Invalid indicates that the failure action is invalid and does nothing.
Given that, if the upgrade is not set as Monitored for UpgradeMode and Rollback for FailureAction, the upgrade will wait a manual action from the operator(user).
If the upgrade is already set to these values, the problem can be either:
The health check and retries are too long, preventing the upgrade to fail quickly, an example is when you HealthCheckDuration is too long or there are too much delay between checks.
The old version is also failing
The following docs give all details: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-application-upgrade-parameters

Kubernetes: active job is erroneously marked as completed after cluster upgrade

I have a working kubernetes cluster (v1.4.6) with an active job that has a single failing pod (e.g. it is constantly restarted) - this is a test, the job should never reach completion.
If I restart the same cluster (e.g. reboot the node), the job is properly re-scheduled and continues to be restarted
If I upgrade the cluster to v1.5.3, then the job is marked as completed once the cluster is up. The upgrade is basically the same as restart - both use the same etcd cluster.
Is this the expected behavior when going to v1.5.x? If not, what can be done to have the job continue running?
I should provide a little background on my problem - the job is to ultimately become a driver in the update process and it is important to have it running (even in face of cluster restarts) until it achieves a certain goal. Is this possible using a job?
In v1.5.0 extensions/v1beta1.Jobs was deprecated in favor of batch/v1.Job, so simply upgrading the cluster without updating the job definition is expected to cause side effects.
See the Kubernetes CHANGELOG for a complete list of changes and deprecations in v1.5.0.