Attempted to install: jFrog Artifactory HA
Platform: GCE kubernetes cluster on CoreOS; 1 master, 2 workers
Installation method: Helm chart
Helm steps taken:
Add jFrog repo to local helm: helm repo add jfrog https://charts.jfrog.io
Install license as kubernetes secret in cluster: kubectl create secret generic artifactory-cluster-license --from-file=./art.lic
Install via helm:
helm install --name artifactory-ha jfrog/artifactory-ha
--set artifactory.masterKey=,artifactory.license.secret=artifactory-cluster-license,artifactory.license.dataKey=art.lic
Result:
Helm installation went without complaint. Checked services, seemed to be fine, LoadBalancer was pending and came online.
Checked PVs and PVCs, seemed to be fine and bound:
NAME STATUS
artifactory-ha-postgresql Bound
volume-artifactory-ha-artifactory-ha-member-0 Bound
volume-artifactory-ha-artifactory-ha-primary-0 Bound
Checked the pods and only postgres was ready:
NAME READY STATUS RESTARTS AGE
artifactory-ha-artifactory-ha-member-0 0/1 Running 0 3m
artifactory-ha-artifactory-ha-primary-0 0/1 Running 0 3m
artifactory-ha-nginx-697844f76-jt24s 0/1 Init:0/1 0 3m
artifactory-ha-postgresql-676999df46-bchq9 1/1 Running 0 3m
Waited for a few minutes, no change. Waited 2 hours, still at the same state as above. Checked logs of the artifactory-ha-artifactory-ha-primary-0 pod (it's quite long, but I can post if that will help anybody determine the problem), but noted this error:
SEVERE: One or more listeners failed to start. Full details will be found in the appropriate container log file. I couldn't think of where else to check for logs. Services were running, other pods seemed to be waiting on this primary pod.
The log continues with SEVERE: Context [/artifactory] startup failed due to previous errors and then starts spewing Java stack dumps after the "ACCESS" ASCII art, messages like WARNING: The web application [artifactory] appears to have started a thread named [Thread-5] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
I ended up leaving the cluster up over night, and now, about 12 hours later, I'm very surprised to see that the "primary" pod did actually come online:
NAME READY STATUS RESTARTS AGE
artifactory-ha-artifactory-ha-member-0 1/1 Terminating 0 19m
artifactory-ha-artifactory-ha-member-1 0/1 Terminating 0 17m
artifactory-ha-artifactory-ha-primary-0 1/1 Running 0 3h
artifactory-ha-nginx-697844f76-vsmzq 0/1 Running 38 3h
artifactory-ha-postgresql-676999df46-gzbpm 1/1 Running 0 3h
Though, the nginx pod did not. It eventually succeeded at its init container command (until nc -z -w 2 artifactory-ha 8081 && echo artifactory ok; do), but cannot pass its readiness probe: Warning Unhealthy 1m (x428 over 3h) kubelet, spczufvthh-worker-1 Readiness probe failed: Get http://10.2.2.45:80/artifactory/webapp/#/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Perhaps I missed some required step in the setup or helm installation switches? This is my first attempt at setting up jFrog Artifactory HA, and I noticed most of the instructions seem to be for baremetal clusters, so perhaps I confused something.
Any help is appreciated!
Turned out we messed up a couple of things, and had a few misunderstandings about how the install process works. Maybe this will be some help to people in the future.
1) The masterKey value needs to be at least 16 characters long. We had initially tried too short of a key. We tried installing again and writing this new masterKey to a secret instead, but...
2) The values in the secrets seem to get read once at initial install attempt, then they are written to the persistent volume and updating the secret after that seems to have no effect.
3) We also didn't understand the license key format and constraints. You need a license for every node that will run Artifactory, and all the licenses go into a single file, with each license separated by two return/new lines.
The error logs were pretty unhelpful to us in these errors. We eventually wiped out the install, including the PVs, and finally everything went fine.
Related
I have a 3-node ubuntu microk8s installation and it seems to be working ok. All 3 nodes are management nodes.
On only one of the nodes, I get an error message and associated delay whenever I use a kubectl command. It looks like this:
$ time kubectl get pods
I0324 03:49:44.270996 514696 request.go:665] Waited for 1.156689289s due to client-side throttling, not priority and fairness, request: GET:https://127.0.0.1:16443/apis/authentication.k8s.io/v1?timeout=32s
NAME READY STATUS RESTARTS AGE
sbnweb-5f9d9b977f-lw7t9 1/1 Running 1 (10h ago) 3d3h
shell-6cfccdbd47-zd2tn 1/1 Running 0 6h39m
real 0m6.558s
user 0m0.414s
sys 0m0.170s
The error message always shows a different URL each time. I tried looking up the error code (I0324) and haven't found anything useful.
The other two nodes don't show this behavior. No error message and completes the request in less than a second.
I'm new to k8s so I am not sure how to diagnose this kind of problem. Any hints on what to look for would be greatly appreciated.
Here's a good write-up about the issue. For some cases rm -rf ~/.kube/cache will remove the issue.
I had a same error with kubectl on Windows. Deleting "http-cache" folder in ".kube" fixed it problem. c:\Users****.kube\http-cache\
In some cases after a node reboot the Kubernetes cluster managed by MicroK8S cannot start pods.
With a describe of the pod failing to be ready I could see that the pod was stuck in a "Pulling image" state during several minutes without any other event, as following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 50s default-scheduler Successfully assigned default/my-service-abc to my-desktop
Normal Pulling 8m kubelet Pulling image "127.0.0.1:32000/some-service"
Pulling from the node with docker pull 127.0.0.1:32000/some-service works perfectly, so it doesn't seem a problem with Docker.
I have upgraded Docker just in case.
I seem to run the latest version of microk8s.
Running a sudo microk8s inspect gives no error/warning, everything seemed to work perfectly.
As a Docker pull did work locally, it's really the Kubelet app which is communicating with Docker that seemed stuck.
Even with a sudo service docker stop && sudo service docker start it did not work.
Even a rollout did not suffice to get out of the Pulling state after the reboot of docker.
Worst of all, a reboot of the server did not change anything (the pod that were up were still working, but all the other pods (70%) were down and in ContainerCreating state.
Checking the status systemctl status snap.microk8s.daemon-kubelet did not report any error.
The only thing that worked seemed to do this:
sudo systemctl reboot snap.microk8s.daemon-kubelet
However, it also rebooted the whole node, so that's something to do in a last case scenario (same as the node reboot).
I'm trying to install and configure Velero for kubernetes backup. I have followed the link to configure it in my GKE cluster. The installation went fine, but velero is not working.
I am using google cloud shell for running all my commands (I have installed and configured velero client in my google cloud shell)
On further inspection on velero deployment and velero pods, I found out that it is not able to pull the image from the docker repository.
kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
velero-5489b955f6-kqb7z 0/1 Init:ErrImagePull 0 20s
Error from velero pod (kubectl describe pod) (output redacted for readability - only relevant info shown below)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 38s default-scheduler Successfully assigned velero/velero-5489b955f6-kqb7z to gke-gke-cluster1-default-pool-a354fba3-8674
Warning Failed 22s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Failed to pull image "velero/velero-plugin-for-gcp:v1.1.0": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Failed 22s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Error: ErrImagePull
Normal BackOff 21s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Back-off pulling image "velero/velero-plugin-for-gcp:v1.1.0"
Warning Failed 21s kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Error: ImagePullBackOff
Normal Pulling 8s (x2 over 37s) kubelet, gke-gke-cluster1-default-pool-a354fba3-8674 Pulling image "velero/velero-plugin-for-gcp:v1.1.0"
Command used to install velero: (some of the values are given as variables)
velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.1.0 \
--bucket $storagebucket \
--secret-file ~/velero-backup-storage-sa-key.json
Velero Version
velero version
Client:
Version: v1.4.2
Git commit: 56a08a4d695d893f0863f697c2f926e27d70c0c5
<error getting server version: timed out waiting for server status request to be processed>
GKE version
v1.15.12-gke.2
Isn't this a Private Cluster ? – mario 31 mins ago
#mario this is a private cluster but I can deploy other services without any issues (for eg: I have deployed nginx successfully) –
Sreesan 15 mins ago
Well, this is a know limitation of GKE Private Clusters. As you can read in the documentation:
Can't pull image from public Docker Hub
Symptoms
A Pod running in your cluster displays a warning in kubectl describe such as Failed to pull image: rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Potential causes
Nodes in a private cluster do not have outbound access to the public
internet. They have limited access to Google APIs and services,
including Container Registry.
Resolution
You cannot fetch images directly from Docker Hub. Instead, use images
hosted on Container Registry. Note that while Container Registry's
Docker Hub
mirror
is accessible from a private cluster, it should not be exclusively
relied upon. The mirror is only a cache, so images are periodically
removed, and a private cluster is not able to fall back to Docker Hub.
You can also compare it with this answer.
It can be easily verified on your own by making a simple experiment. Try to run two different nginx deployments. First based on image nginx (which equals to nginx:latest) and the second one based on nginx:1.14.2.
While the first scenario is perfectly feasible because the nginx:latest image can be pulled from Container Registry's Docker Hub mirror which is accessible from a private cluster, any attempt of pulling nginx:1.14.2 will fail which you'll see in Pod events. It happens because the kubelet is not able to find this version of the image in GCR and it tries to pull it from public docker registry (https://registry-1.docker.io/v2/), which in Private Clusters is not possible. "The mirror is only a cache, so images are periodically removed, and a private cluster is not able to fall back to Docker Hub." - as you can read in docs.
If you still have doubts, just ssh into your node and try to run following commands:
curl https://cloud.google.com/container-registry/
curl https://registry-1.docker.io/v2/
While the first one works perfectly, the second one will eventually fail:
curl: (7) Failed to connect to registry-1.docker.io port 443: Connection timed out
Reason ? - "Nodes in a private cluster do not have outbound access to the public internet."
Solution ?
You can search what is currently available in GCR here.
In many cases you should be able to get the required image if you don't specify it's exact version (by default latest tag is used). While it can help with nginx, unfortunatelly no version of velero/velero-plugin-for-gcp is currently available in Google Container Registry's Docker Hub mirror.
Granting private nodes outbound internet access by using Cloud NAT seems the only reasonable solution that can be applied in your case.
I solved this problem by realizing that version of:
velero/velero-plugin-for-gcp
is not following the version of:
velero/velero
For example, now latest versions are:
velero/velero:v1.9.1 and velero/velero-plugin-for-gcp:v1.5.0
I am working on two servers, each with a number of pods. The first server is Validation Env and I can use kubernetes commands, but the second server is on Prod Env I do not have full rights. It is out of the question to get full rights on the last one.
So, I am doing a platform stability statistic and I need info about the last reset of pods. I can see the "Age" but I cannot use a screenshot in my statistic, so I need a command that outputs every pods age or the last reset.
P.S. Every night at 00:00 the pods are saved and archived in a separate folder.
Get pods already gives you that info:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-7cdbd8cdc9-8pnzq 1/1 Running 0 36s
$ kubectl delete po nginx-7cdbd8cdc9-8pnzq
pod "nginx-7cdbd8cdc9-8pnzq" deleted
$ kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-7cdbd8cdc9-67l8l 1/1 Running 0 4s
I found a solution:
command:
zgrep "All subsystems started successfully" 201911??/*ota*
response:
23:23:37,429 [INFO ] main c.o.c.a.StartUp - All subsystems started successfully
P.S. "ota" is my pod's name.
I am currently working on a monitoring service that will monitor Kubernetes' deployments and their pods. I want to notify users when a deployment is not running the expected amount of replicas and also when pods' containers restart unexpectedly. This may not be the right things to monitor and I would greatly appreciate some feedback on what I should be monitoring.
Anyways, the main question is the differences between all of the Statuses of pods. And when I say Statuses I mean the Status column when running kubectl get pods. The statuses in question are:
- ContainerCreating
- ImagePullBackOff
- Pending
- CrashLoopBackOff
- Error
- Running
What causes pod/containers to go into these states?
For the first four Statuses, are these states recoverable without user interaction?
What is the threshold for a CrashLoopBackOff?
Is Running the only status that has a Ready Condition of True?
Any feedback would be greatly appreciated!
Also, would it be bad practice to use kubectl in an automated script for monitoring purposes? For example, every minute log the results of kubectl get pods to Elasticsearch?
You can see the pod lifecycle details in k8s documentation.
The recommended way of monitoring kubernetes cluster and applications are with prometheus
I will try to tell what I see hidden behind these terms
ContainerCreating
Showing when we wait to image be downloaded and the
container will be created by a docker or another system.
ImagePullBackOff
Showing when we have problem to download the image from a registry. Wrong credentials to log in to the docker hub for example.
Pending
The container starts (if start take time) or started but redinessProbe failed.
CrashLoopBackOff
This status showing when container restarts occur too much often. For example, we have process that tries to read not exists file and crash. Then the container will be recreated by Kube and repeat.
Error
This is pretty clear. We have some errors to run the container.
Running
All is good container running and livenessProbe is OK.