fabric:/System/InfraustructureService is not healthy on Service Fabric Cluster - azure-service-fabric

I deployed a fresh Service Fabric Cluster with a durability level of Silver and the fabric:/System/InfrastructureService/FE service is unhealthy with the following error:
Unhealthy event: SourceId='System.InfrastructureService',
Property='CoordinatorStatus', HealthState='Warning',
ConsiderWarningAsError=false. Failed to create infrastructure
coordinator: System.Reflection.TargetInvocationException: Exception
has been thrown by the target of an invocation. --->
System.Fabric.InfrastructureService.ManagementException: Unable to get
tenant policy agent endpoint from registry; verify that tenant
settings match InfrastructureService configuration

The durability level needs to be specified in two places: the VMSS resource and the Service Fabric Resource in the ARM template.
My template had it set to Bronze in the VMSS resource and silver in the Service Fabric resource - once I made them match, it worked.

Related

GKE metadata server pod issues with new nodes during autoscaling

We have an issue with gke workload identity.We enabled workload identity for our prod clusters.when enable workload identity it creates daemon set.whenever you application required service account key to access the gcp resource this make a api call to metadata server for accessing service accounts
Issue:-
In morning we have some nodes autoscaling so when a new node came then first our application pods got scheduled and then only these daemon set pods got scheduled.These time then application pods got restarted because in our application will try to access an api call to metadata server pods for initiating some beans for bigquery client
I read some docs also they are suggesting some priority class.
But this gke daemon set pods already have a priority class but it won't work in this usecase

Architecture Question - User Driven Resource Allocation

I am working on a SaaS application built on Azure AKS. Users will connect to a web frontend, and depending on their selection, will deploy containers on demand for their respective Organization. There will be a backend API layer that will connect to the Kubernetes API for the deployment of different YAML configurations.
Users will select a predefined container (NodeJs container app), and behind the scenes that container will be created from a template and a URL provided to the user to consume that REST API resource via common HTTP verbs.
I read the following blurb on the Kubernetes docs:
You'll rarely create individual Pods directly in Kubernetes—even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a controller), the new Pod is scheduled to run on a Node in your cluster. The Pod remains on that node until the Pod finishes execution, the Pod object is deleted, the Pod is evicted for lack of resources, or the node fails.
I am thinking that that each "organization account" in my application should deploy containers that are allocated a shared context constrained to a Pod, with multiple containers spun up for each "resource" request. This is because, arguably, an Organization would prefer that their "services" were unique to their Organization and not shared with the scope of others. Assume that namespace, service, or pod name is not a concern as each will be named on the backend with a GUID or similar unique identifier.
Questions:
Is this an appropriate use of Pods and Services in Kubernetes?
Will scaling out mean that I add nodes to the cluster to support the
maximum constraint of 110 Pods / node?
Should I isolate these data services / pods from the front-end to its own dedicated cluster, then add a cluster when (if) maximum Node count of 15,000 is reached?
I guess you should have a look at Deployments
A container is in a pod.
A pod is in a deployment
A service exposes a deployment.

What is the concept of Service Affinity in OpenShift?

Situation
When a deployment fails on our OpenShift 3.11 instance because of a Failed Scheduling error event, a message comparable to the following shown:
Failed Scheduling 0/11 nodes are available: 10 CheckServiceAffinity, 2 ExistingPodsAntiAffinityRulesNotMatch, 2 MatchInterPodAffinity, 5 MatchNodeSelector.
In the above error message, the term CheckServiceAffinity is used. While it's easy to find articles on Pod Affinity or Anti-Affinity, I couldn't find a detailed description of Service Affinity.
Question
What is Service Affinity?
Is it a concept of Kubernetes or is it exclusive to OpenShift?
ServiceAffinity places pods on nodes based on the service running on that pod. Placing pods of the same service on the same or co-located nodes can lead to higher efficiency.
It's a concept of openshift and not of open source kubernetes.
https://docs.openshift.com/container-platform/3.9/admin_guide/scheduling/scheduler.html#configurable-predicates

Connect to On Premises Service Fabric Cluster

I've followed the steps from Microsoft to create a Multi-Node On-Premises Service Fabric cluster. I've deployed a stateless app to the cluster and it seems to be working fine. When I have been connecting to the cluster I have used the IP Address of one of the nodes. Doing that, I can connect via Powershell using Connect-ServiceFabricCluster nodename:19000 and I can connect to the Service Fabric Explorer website (http://nodename:19080/explorer/index.html).
The examples online suggest that if I hosted in Azure I can connect to http://mycluster.eastus.cloudapp.azure.com:19000 and it resolves, however I can't work out what the equivalent is on my local. I tried connecting to my sample cluster: Connect-ServiceFabricCluster sampleCluster.domain.local:19000 but that returns:
WARNING: Failed to contact Naming Service. Attempting to contact Failover Manager Service...
WARNING: Failed to contact Failover Manager Service, Attempting to contact FMM...
False
WARNING: No such host is known
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
Am I missing something in my setup? Should there be a central DNS entry somewhere that allows me to connect to the cluster? Or am I trying to do something that isn't supported On-Premises?
Yup, you're missing a load balancer.
This is the best resource I could find to help, I'll paste relevant contents in the event of it becoming unavailable.
Reverse Proxy — When you provision a Service Fabric cluster, you have an option of installing Reverse Proxy on each of the nodes on the cluster. It performs the service resolution on the client’s behalf and forwards the request to the correct node which contains the application. In majority of the cases, services running on the Service Fabric run only on the subset of the nodes. Since the load balancer will not know which nodes contain the requested service, the client libraries will have to wrap the requests in a retry-loop to resolve service endpoints. Using Reverse Proxy will address the issue since it runs on each node and will know exactly on what nodes is the service running on. Clients outside the cluster can reach the services running inside the cluster via Reverse Proxy without any additional configuration.
Source: Azure Service Fabric is amazing
I have an Azure Service Fabric resource running, but the same rules apply. As the article states, you'll need a reverse proxy/load balancer to resolve not only what nodes are running the API, but also to balance the load between the nodes running that API. So, health probes are necessary too so that the load balancer knows which nodes are viable options for sending traffic to.
As an example, Azure creates 2 rules off the bat:
1. LBHttpRule on TCP/19080 with a TCP probe on port 19080 every 5 seconds with a 2 count error threshold.
2. LBRule on TCP/19000 with a TCP probe on port 19000 every 5 seconds with a 2 count error threshold.
What you need to add to make this forward-facing is a rule where you forward port 80 to your service http port. Then the health probe can be an http probe that hits a path to test a 200 return.
Once you get into the cluster, you can resolve the services normally and SF will take care of availability.
In Azure-land, this is abstracted again to using something like API Management to further reverse proxy it to SSL. What a mess but it works.
Once your load balancer is set up, you'll have a single IP to hit for management, publishing, and regular traffic.

Unhealthy event: SourceId='System.RAP'

I have an application (Azure Service Fabric) that works successfully locally and on environment with 3 nodes but on environment with 5 nodes two services contain warning (for one of the 5 replicas):
Unhealthy event: SourceId='System.RAP', Property='IStatelessServiceInstance.OpenDuration', HealthState='Warning', ConsiderWarningAsError=false.
As a result sometimes we get 503 Service Unavailable Error.
The issue was fixed with redeploy of Azure Service Fabric.