Azure fabric setup issues - azure-service-fabric

I am a fairly new to Azure services. I am trying to install azure fabric and run it locally on my dev machine.
I created an application using the templates provided in VS 2017. I get the following error when I tried installing fabric and running it.
Tried this SO link and this and still unable to get it to work.
Please can someone help?
Using Cluster Data Root: C:\SfDevCluster\Data
Using Cluster Log Root: C:\SfDevCluster\Log
The generated json path is C:\Users\User\AppData\Local\Temp\tmpEE68.tmp.json
Processing and validating cluster config.
Create node configuration succeeded
Starting service FabricHostSvc. This may take a few minutes...
Waiting for Service Fabric Cluster to be ready. This may take a few minutes...
Local Cluster ready status: 4% completed.
Local Cluster ready status: 8% completed.
Local Cluster ready status: 12% completed.
Local Cluster ready status: 17% completed.
Local Cluster ready status: 21% completed.
Local Cluster ready status: 25% completed.
Local Cluster ready status: 29% completed.
Local Cluster ready status: 33% completed.
Local Cluster ready status: 38% completed.
Local Cluster ready status: 42% completed.
Local Cluster ready status: 46% completed.
Local Cluster ready status: 50% completed.
Local Cluster ready status: 54% completed.
Local Cluster ready status: 58% completed.
Local Cluster ready status: 62% completed.
Local Cluster ready status: 67% completed.
Local Cluster ready status: 71% completed.
Local Cluster ready status: 75% completed.
Local Cluster ready status: 79% completed.
Local Cluster ready status: 83% completed.
Local Cluster ready status: 88% completed.
Local Cluster ready status: 92% completed.
Local Cluster ready status: 96% completed.
Local Cluster ready status: 100% completed.
WARNING: Service Fabric Cluster is taking longer than expected to connect.
Waiting for Naming Service to be ready. This may take a few minutes...
No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS
issue.
At C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Scripts\ClusterSetupUtilities.psm1:620 char:12
+ [void](Connect-ServiceFabricCluster #connParams)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [Connect-ServiceFabricCluster], FabricException
+ FullyQualifiedErrorId : TestClusterConnectionErrorId,Microsoft.ServiceFabric.Powershell.ConnectCluster

We had the same issue on two machines and found a workaround here.
It involves adding FabricContainerAppsEnabled to the cluster manifest. Note that we had to re-boot our machines to get the local cluster manager to honor the new setting, though I suppose a service re-start might have worked also.

Related

gitlab: unable to access git repository: Operation timed out

Our registered Gitlab-runner (on Kubernetes) was working fine, after upgrading the version of Gitlab, it can't clone the projects anymore! Does anyone have any idea about this issue?
Here is the log of the issue:
Running with gitlab-runner 14.9.0 (d1f69508)
on gitlab-runner-dev K5KVWdx-
Preparing the "kubernetes" executor
30:00
Using Kubernetes namespace: cicd
Using Kubernetes executor with image <docker-registry>:kuber_development ...
Using attach strategy to execute scripts...
Preparing environment
30:07
Waiting for pod cicd/runner-k5kvwdx--project-1227-concurrent-02kqgq to be running, status is Pending
Waiting for pod cicd/runner-k5kvwdx--project-1227-concurrent-02kqgq to be running, status is Pending
ContainersNotReady: "containers with unready status: [build helper]"
ContainersNotReady: "containers with unready status: [build helper]"
Running on runner-k5kvwdx--project-1227-concurrent-02kqgq via gitlab-runner-85776bd9c6-rkdvl...
Getting source from Git repository
32:13
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/bigdata/search/query-processing-module/.git/
Created fresh repository.
fatal: unable to access '<git-repository>': Failed to connect to <gitlab-url> port 443 after 130010 ms: Operation timed out
Cleaning up project directory and file based variables
30:01
ERROR: Job failed: command terminated with exit code 1
Here is how I would debug this Issue:
Make sure there are no NetworkPolicies present, that are restricting the egress of the pod.
If you have the newest Kubernetes version you can run an ephemeral debug container inside the Pod to examine the networking situation. Docs
kubectl debug -it ephemeral-demo --image=busybox:1.28 --target=ephemeral-demo
If not you can try to get a shell inside your container and examine the situation from there or you can try to start a pod on the same node and try to connect from there.
As soon as you have a shell inside some container that doesn't work try to answer the following questions:
Can you connect to some other Server?
Can you resolve the hostname?
Is the IP a private one and overlapping with some internal Kubernetes IPs?
Can you ping the IP? If yes
Can you curl the IP? If no
If you open another port on the target machine can you connect to this port? => if yes probably some firewall problem somewhere
If no (can't ping) => can be either firewall related or IP routing related.
I cannot say for sure what is wrong, but try the steps above and hopefully you get some insight into where the Problem is.

EKS node moves to NodeNotReady state when running a batch jobs

I am running a batch job in my EKS cluster that trains a ML model and the training goes on for 8-10hours. However, it seems like the node on which the job runs moves is killed and the job is restarted on a new node. I am monitoring the Node in Prometheus and seems like there was no CPU or OOM issue.
My next bet was to look into the EKS cloudtrail logs and right when the node is removed I see below events -
kube-controller-manager log
controller_utils.go:179] Recording status change NodeNotReady event message for node XXX
controller_utils.go:121] Update ready status of pods on node [XXX]
event.go:274] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"XXX", UID:"1bf33ec8-41cd-434a-8579-3ba4b8cdd5f1", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node XXX status is now: NodeNotReady
node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
I0609 01:00:48.962465 1 node_lifecycle_controller.go:917] Node XXX is unresponsive as of 2021-06-09 01:00:48.962450508 +0000 UTC m=+5151508.967069635. Adding it to the Taint queue.
node_lifecycle_controller.go:180] deleting node since it is no longer present in cloud provider: XXX
kube-scheduler log
node_tree.go:113] Removed node "XXX" in group "us-east-2:\x00:us-east-2b" from NodeTree
I checked the kubelet logs but it does not have any message moving the node to NotReady status. I was expecting to atleast see this message in the kubelet log - https://github.com/kubernetes/kubernetes/blob/e9de1b0221dd8687aba527e682fafc7c33370c09/pkg/kubelet/kubelet_node_status.go#L682
Which makes me wonder if the kubelet dies or the node is not reachable or any connection lost from kube-api-server to kubelet on that node.
I have been working on this for days to debug this issue but with no success.
Note: The batch job running in Kubernetes do run successfully eventually on restart. Also this issue is sporadic i.e sometime the restart happens and sometimes it does not and finishes in the first run.
Are you using spot instance nodes? That might be one of the reason where the node gets terminated based on the spot / bid price changes. Try dedicated instance.

pixielabs deploy stuck Wait for PEMs/Kelvin

After installing pixielabs with the bash-installer and deploying with px deploy, this deployment got stuck (over 30min) with:
Wait for PEMs/Kelvin
After aborting I got an new namespace pl with many pods pending or in Init.
But no working pixielab.
Check if the etcd pod in the pl namespace is in pending state.
The Pixie Command Module is deployed in the K8s cluster to isolate data storage, therefore you'll need a persistent volume in your cluster.

Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0" while installing Velero in GKE Cluster

I'm trying to install and configure Velero for kubernetes backup. I have followed the link to configure it in my GKE cluster. The installation went fine, but velero is not working.
I am using google cloud shell for running all my commands (I have installed and configured velero client in my google cloud shell)
On further inspection on velero deployment and velero pods, I found out that it is not able to pull the image from the docker repository.
kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
velero-5489b955f6-kqb7z 0/1 Init:ErrImagePull 0 20s
Error from velero pod (kubectl describe pod) (output redacted for readability - only relevant info shown below)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38s default-scheduler Successfully assigned velero/velero-5489b955f6-kqb7z to gke-gke-cluster1-default-pool-a354fba3-8674
Warning Failed 22s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 22s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Error: ErrImagePull
Normal BackOff 21s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Back-off pulling image "velero/velero-plugin-for-gcp:v1.1.0"
Warning Failed 21s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Error: ImagePullBackOff
Normal Pulling 8s (x2 over 37s) kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Pulling image "velero/velero-plugin-for-gcp:v1.1.0"
Command used to install velero: (some of the values are given as variables)
velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.1.0 \
--bucket $storagebucket \
--secret-file ~/velero-backup-storage-sa-key.json
Velero Version
velero version
Client:
Version: v1.4.2
Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
<error getting server version: timed out waiting for server status request to be processed>
GKE version
v1.15.12-gke.2
Isn't this a Private Cluster ? – mario 31 mins ago
#mario this is a private cluster but I can deploy other services without any issues (for eg: I have deployed nginx successfully) –
Sreesan 15 mins ago
Well, this is a know limitation of GKE Private Clusters. As you can read in the documentation:
Can't pull image from public Docker Hub
Symptoms
A Pod running in your cluster displays a warning in kubectl describe such as Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Potential causes
Nodes in a private cluster do not have outbound access to the public
internet. They have limited access to Google APIs and services,
including Container Registry.
Resolution
You cannot fetch images directly from Docker Hub. Instead, use images
hosted on Container Registry. Note that while Container Registry's
Docker Hub
mirror
is accessible from a private cluster, it should not be exclusively
relied upon. The mirror is only a cache, so images are periodically
removed, and a private cluster is not able to fall back to Docker Hub.
You can also compare it with this answer.
It can be easily verified on your own by making a simple experiment. Try to run two different nginx deployments. First based on image nginx (which equals to nginx:latest) and the second one based on nginx:1.14.2.
While the first scenario is perfectly feasible because the nginx:latest image can be pulled from Container Registry's Docker Hub mirror which is accessible from a private cluster, any attempt of pulling nginx:1.14.2 will fail which you'll see in Pod events. It happens because the kubelet is not able to find this version of the image in GCR and it tries to pull it from public docker registry (https://registry-1.docker.io/v2/), which in Private Clusters is not possible. "The mirror is only a cache, so images are periodically removed, and a private cluster is not able to fall back to Docker Hub." - as you can read in docs.
If you still have doubts, just ssh into your node and try to run following commands:
curl https://cloud.google.com/container-registry/
curl https://registry-1.docker.io/v2/
While the first one works perfectly, the second one will eventually fail:
curl: (7) Failed to connect to registry-1.docker.io port 443: Connection timed out
Reason ? - "Nodes in a private cluster do not have outbound access to the public internet."
Solution ?
You can search what is currently available in GCR here.
In many cases you should be able to get the required image if you don't specify it's exact version (by default latest tag is used). While it can help with nginx, unfortunatelly no version of velero/velero-plugin-for-gcp is currently available in Google Container Registry's Docker Hub mirror.
Granting private nodes outbound internet access by using Cloud NAT seems the only reasonable solution that can be applied in your case.
I solved this problem by realizing that version of:
velero/velero-plugin-for-gcp
is not following the version of:
velero/velero
For example, now latest versions are:
velero/velero:v1.9.1 and velero/velero-plugin-for-gcp:v1.5.0

All Kubernetes Pods go down simultaneously periodically

I've been running a Kubernetes cluster for a while now, but I haven't been able to keep it stable.
My cluster consists of four nodes, two masters and two workers. All nodes run on the same physical server, which in turn runs VMware vSphere 6.5. Each node runs CoreOS stable (1353.7.0), and I'm running Kubernetes/Hyperkube v1.6.4, using Calico for networking. I've followed the steps in this guide.
What happens is that for a few hours/days, the cluster will run without a hitch. Then, all of a sudden (for no discernible reason as far as I can tell) all my pods go to status "Pending" and stay that way. Any hosted services are then no longer reachable.
After a while (usually 5 to 10 minutes), it seems to restore itself, after which it starts recreating all my pods, and trying (but failing) to shut down all my running pods. Some of the newly created pods come up, but will initially have no connection to the internet.
For a couple of weeks now I've had this issue intermittently, and it's been preventing me from using Kubernetes in production. I'd really like to figure out what's been causing this!
Weirdly enough, when I try to diagnose the problem by inspecting the logs,
I've noticed that on both of my worker nodes, the journald logs will have become corrupted! On the master nodes, the log is still readable, but not very informative.
Even when running, kubelet is constantly emitting errors in its logs. On all the nodes, this is what's posted about once a minute:
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.012890 24228 cni.go:275] Error deleting network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014762 24228 remote_runtime.go:109] StopPodSandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014818 24228 kuberuntime_gc.go:138] Failed to stop sandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:38:07 kube-master1 kubelet-wrapper[24228]: I0526 09:38:07.422341 24228 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/9a378211-3597-11e7-a7ec-000c2958a0d7-default-token-0p3gf" (spec.Name: "default-token-0p3gf") pod "9a378211-3597-11e7-a7ec-000c2958a0d7" (UID: "9a378211-3597-11e7-a7ec-000c2958a0d7").
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: W0526 09:38:14.037553 24228 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "logstash-s3498_default": Unexpected command output nsenter: cannot open : No such file or directory
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: with error: exit status 1
I've googled this error, encountered this issue, but that has been closed and people indicate that using v1.6.0 or later should resolve it, but it definitely hasn't in my case...
Can anybody point me in the right direction?!
Thanks!
Seen this as well. problem seems to go away if you downgrade CoreOS to a older version with docker 1.12.3.
Docker is a nightmare with regressions in every single version they release :(