Unhealthy event: SourceId='System.RAP' - azure-service-fabric

I have an application (Azure Service Fabric) that works successfully locally and on environment with 3 nodes but on environment with 5 nodes two services contain warning (for one of the 5 replicas):
Unhealthy event: SourceId='System.RAP', Property='IStatelessServiceInstance.OpenDuration', HealthState='Warning', ConsiderWarningAsError=false.
As a result sometimes we get 503 Service Unavailable Error.

The issue was fixed with redeploy of Azure Service Fabric.

Related

How to debug GKE node pool nodes being re-created

GKE cloud logging and monitoring is not leading me to root cause. Over some period every single node was replaced (I could verify by their age with kubectl) leading to a short complete outage (several minutes) for all services as detected by external monitoring.
Nodes are not preemtible
gcloud container operations list does not show any node-upgrade operations
In cloud Logging Node event logs, there were multiple of these:
Node <...> status is now: NodeNotReady
Deleting node <...> because it does not exist in the cloud provider
Node <...> event: Removing Node <...> from Controller
node-problem-detector logs has a bunch of these:
"2022/05/26 20:35:10 Failed to export to Stackdriver: rpc error: code = DeadlineExceeded desc = One or more TimeSeries could not be written: Deadline expired before operation could complete.: gce_instance{zone:europe-west2-a,instance_id:<...>} timeSeries[0-199]: kubernetes.io/internal/node/guest/net/rx_multicast{instance_name<...>,interface_namegkeb5dd8ca7306}"
Cluster autoscaler shows only a few nodes added and removed, but spread out over several hours.
During the period building up to this, one service in the cluster was receiving a DDoS attack, so network pressure was high, but there was no CPU throttling or OOM kills.

Mutation Webhooks in EKS isn't working when calico used as cni

I want to replace aws-node cni to calico. I've removed aws-node daemonset and installed calico. Network between pods works great, but when I'm using mutation webhooks, kube-api-server couldn't connect to the target service, because there are no routes from it to pods:
E0304 15:41:02.131212 1 dispatcher.go:71] failed calling webhook "secrets.vault.admission.banzaicloud.com": Post https://vault-secrets-webhook.vault.svc:443/secrets?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The service has endpoinds and it's available from pods. If I'm using default cni, connection from kube-api-server to webhook's service works, because main vpc route table has necessary routes.
Is it possible to solve this problem?
I hope you are following the docs mentioned here https://docs.aws.amazon.com/eks/latest/userguide/calico.html
This calico runs along with aws-cni ie you still need aws-node.
If you wanna replace aws-cni with stock calico it is still possible but it isn't tested and you will lose features of EKS which depend on aws-node.
So if you are just looking for better security on EKS just install calico on the existing EKS and it is officially supported.

fabric:/System/InfraustructureService is not healthy on Service Fabric Cluster

I deployed a fresh Service Fabric Cluster with a durability level of Silver and the fabric:/System/InfrastructureService/FE service is unhealthy with the following error:
Unhealthy event: SourceId='System.InfrastructureService',
Property='CoordinatorStatus', HealthState='Warning',
ConsiderWarningAsError=false. Failed to create infrastructure
coordinator: System.Reflection.TargetInvocationException: Exception
has been thrown by the target of an invocation. --->
System.Fabric.InfrastructureService.ManagementException: Unable to get
tenant policy agent endpoint from registry; verify that tenant
settings match InfrastructureService configuration
The durability level needs to be specified in two places: the VMSS resource and the Service Fabric Resource in the ARM template.
My template had it set to Bronze in the VMSS resource and silver in the Service Fabric resource - once I made them match, it worked.

Service Fabric Cluster Node gone down suddenly

I have created a 5 node azure service fabric and deployed application on it many times, however one of node on my service fabric has suddenly gone down and is in error state , while the remaining nodes are active. I've tried manually activating the node through cluster explorer , but nothing happens. I've also tried restarting the node, neither of which has helped.
Is there anyway to force this node back online, or am I going to have to delete the cluster and start again ?

Service Fabric Cluster Upgrade While it is Running

I setup a cluster in my production environment with 3 VMs and setup some applications there. Now if I open the explorer, I got a warning in the cluster
Unhealthy event: SourceId='System.UpgradeOrchestrationService',
Property='ClusterVersionSupport', HealthState='Warning',
ConsiderWarningAsError=false. The current cluster version 5.4.164.9494
support ends 6/10/2017 12:00:00 AM. Please view available upgrades
using Get-ServiceFabricRegisteredClusterCodeVersion and upgrade using
Start-ServiceFabricClusterUpgrade.
My application is till running there.
I didn't update the cluster so far. I have a question that, can we directly update the cluster without unregistering the applications? will it have any impact on the already running application while we updating the cluster configuration?
Thanks,
Divya
As #LodeRunner28 stated, you can upgrade the cluster version while it is running, and actually in Azure you can setup your cluster to allow it to be updated automatically as new versions of Service Fabric are available.