Compute Engine unhealthy instance down 50% of the time - kubernetes

I started to use google cloud 3 days ago or so, so I am completely new to it.
I have 4 pods deployed to Google Kubernetes Engine:
Frontend: react app,
Redis,
Backend: made up of 2 containers, a nodejs server and a cloudsql-proxy,
Nginx-ingress-controller
** And also have an sql instance running for my postgresql database, hence the cloudsql-proxy container
This setup works well 50% of the time, but every now and then all the pods crash or/and the containers are recreated.
I tried to check all the relevant logs, but I really don't know which are actually relevant. But there is one thing that I found which correlates with my issue, I have 2 VM instances running, and one of them might be the faulty one:
When I hover the loading spin, it says Instance is being verified, and it seems to be in this state 80% of the time, when it is not there is a yellow warning beside the name of the instance, saying The resource is not ready.
Here is the cpu usage of the instance (the trend is the same for all the hardware), I checked in the logs of my frontend and backend containers, here is
the last logs that correspond to a cpu drop:
2019-03-13 01:45:23.533 CET - 🚀 Server ready
2019-03-13 01:45:33.477 CET - 2019/03/13 00:45:33 Client closed local connection on 127.0.0.1:5432
2019-03-13 01:54:07.270 CET - yarn run v1.10.1
As you can see here, all the pods are being recreated...
I think that it might come from the fact that the faulty instance is unhealthy:
Instance gke-*****-production-default-pool-0de6d459-qlxk is unhealthy for ...
...the health check is proceeding and recreating/restarting the instance again and again. Tell me if I am wrong.
So, how can I discover what is making this instance unhealthy?

Related

Airflow KubernetesPodOperator Losing Connection to Worker Pod

Experiencing an odd issue with KubernetesPodOperator on Airflow 1.1.14.
Essentially for some jobs Airflow is losing contact with the pod it creates.
[2021-02-10 07:30:13,657] {taskinstance.py:1150} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
When I check logs in kubernetes with kubectl logs I can see that the job carried on past the connection broken error.
The connection broken error seems to happen exactly 1 hour after the last logs that Airflow pulls from the pod (we do have a 1 hour config on connections), but the pod keeps running happily in the background.
I've seen this behaviour repeatedly, and it tends to happen with longer running jobs with a gap in the log output, but I have no other leads. Happy to update the question if certain specifics are misssing.
As I have mentioned in comments section I think you can try to set operators get_logs parameter to False - default value is True .
Take a look: airflow-connection-broken, airflow-connection-issue .

ImageStoreService IdleSecondary replica

I have a Service Fabric cluster with 5 nodes deployed to Azure. A few days ago one of the nodes in the cluster is failing to start the ImageStoreService and I'm seeing the following error:
Unhealthy event: SourceId='System.RAP', Property='IReplicator.BuildReplica(131347642528671415)Duration', HealthState='Warning', ConsiderWarningAsError=false.
The api IReplicator.BuildReplica(131347642528671415) on node _backend_0 is stuck. Start Time (UTC): 2018-07-19 12:47:20.764.
I have tried rebooting the virtual machine but it still remains in this state. It looks like it's currently in the InBuild replica state but I don't know what to look for or how to resolve it.
Edit
I logged into the virtual machine that is failing and looked at the Event Log for Service Fabric. I see tons of warning messages in the logs that look like this:
CopyFile: no new token is found. current token count: 2
and a lot that have this:
ImpersonateAndCopyFile for SourcePath:\10.10.10.1\StoreShare__node_3\131347742554673412\ApplicationName\ServiceName\Code\131347741584517844_1047972020224_30.File.dll, DestinationPath:C:\ProgramData\Microsoft\SF_App__FabricSystem_App4294967295\work\Store\131347742554673413\ApplicationName\ServiceName\Code\131347741584517844_1047972020224_30.File.dll failed: E_ACCESSDENIED. Have tried all access tokens.
Seems like it is failing to either read or write the files. I'm not sure why.

How to use the Python Kubernetes client in a way resilient to GKE Kubernetes Master disruptions?

We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.

Is it possible to control Service Fabric hosted service restart behaviour?

I can't find much documentation on the action that Service Fabric takes when a service it is hosting fails. I have performed some experimentation (using a stateless service in a local cluster), the results of which are below. My question is: is it possible to change this behaviour?
There are two distinct scenarios that I tested.
An exception thrown from the RunAsync() method.
The hosted service is restarted immediately on another cluster node. If no other node is available then it is restarted on the same node. There does not appear to be any limit to the number of times the restart will be attempted or any kind of back-off in terms of the interval between attempts.
The hosted service fails to start (e.g. an exception is thrown before RunAsync() is called).
The hosted service is restarted on the same node. In my test environment there appears to be a fixed interval between restart attempts (15 seconds) but no limit to the number of attempts.
I can see in the cluster configuration that there are some parameters in the Hosting section that look like they might be relevant (ActivationMaxRetryInterval, ActivationRetryBackoffInterval, ActivationMaxFailureCount) and I am guessing that these cover scenario (2) above (assuming that Activation == service start). These affect the entire cluster by the looks of it.

AWS Mongo QuickStart never completes

Problem
I am trying to complete the MongoDB on AWS quickstart to create a simple MongoDB cluster. Unfortunately it never completes the rollout, cancelling after one last installation part (PrimaryReplicaNodeXYWaitForNodeInstallGP2) has not been completed within an hour.
Background
My Settings were the following:
AvailabilityZone0 eu-central-1a
AvailabilityZone1 eu-central-1b
AvailabilityZone2 eu-central-1b
BuildBucket quickstart-reference/mongodb/latest
ClusterReplicaSetCount 0
ClusterShardCount 1
ConfigServerInstanceType t2.micro
Iops 100
KeyName my_definitely_working_keypair
MongoDBVersion 3.2
NATInstanceType t2.small
NodeInstanceType m3.medium
PrimaryReplicaSubnet 10.0.2.0/24
PublicSubnet 10.0.1.0/24
RemoteAccessCIDR XXX.XXX.0.0/16
SecondaryReplicaSubnet0 10.0.3.0/24
SecondaryReplicaSubnet1 10.0.4.0/24
ShardsPerNode 0
VolumeSize 40
VolumeType gp2
VPCCIDR 10.0.0.0/16
Which caused a rollback in the same behaviour, as named in the AWS forum:
In "Ressources", all but one subtask never gets completed and stays on
forever as "PrimaryReplicaNode0WaitForNodeInstallGP2 -
PrimaryReplicaNode0WaitForNodeInstallWaitHandle - Created in Progress
- Ressource creation initiated"
So, I was further researching on the issue. The post referred to another forum thread, where users with the problem should try to delete their DynamoDB entries and set ClusterReplicaSetCount to 3.
Problem here: In DynamoDB there are no entries and changing ClusterReplicaSetCount to 3 also causes a rollback with a similar error:
ConfigServer2WaitForNodeInstall WaitCondition timed out. Received 0
conditions when expecting 1
and later
MONGODBSTACK1 The following resource(s) failed to create:
[ConfigServer1WaitForNodeInstall,
PrimaryReplicaNode00WaitForNodeInstallGP2,
ConfigServer0WaitForNodeInstall,
SecondaryReplicaNode00WaitForNodeInstallGP2,
SecondaryReplicaNode01WaitForNodeInstallGP2,
ConfigServer2WaitForNodeInstall].
Summary
In both cases there is a fail on PrimaryReplicaNodeXYWaitForNodeInstallGP2 (where XY is the number of the node) while all other parts of the installation completed successfully. I am totally in the dark.
Anyone got around this? The quick start is from 2016 and I think there must be people, who have successfully created this mongo stack!?
After days and days of hard struggle and no solution there was an update (since over a year, feels as if my prayers were heard) on the manual and the template:
https://docs.aws.amazon.com/quickstart/latest/mongodb/welcome.html
So this also comes with a completely revised infrastructure and a more sophisticated setup form, changes are described as:
Upgraded MongoDB to version 3.4; removed sharding configuration;
updated security groups and added database security; updated
parameters
Following the tutorial is quite similar to the former versions, so no struggle here.
Everything went out fine and I got my stack completed now consisting of a
mongoDB
mongoDB Replica
bastion stack
vpc stack
So this part is basically done. If something else comes up, I will open a new question for that.
I noticed this after tearing down a dev cluster and attempting to stand up a new one with the same name.
The torn down cluster orphaned a dynamodb table with the name the new stack was trying to publish the worker nodes status onto. I deleted this dynamo table manually and reattempted to spin up the stack with the same name a third time and had success.