kubernetes 1.2 to 1.3 upgrade on CoreOS - kubernetes

There seem to be several issues to go from 1.2 to 1.3 that makes it impossible to upgrade in place.
Is this correct?
When upgrading one worker node to 1.3.4, while the rest is running 1.2.2, the node is never ready
I get lots of 415 errors (unsupported media type?) from kubelet, which seems to indicate incompatible format.
kubelet[2927]: E0804 01:55:13.794921 2927 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:".146777d057f9b62b", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"198.245.63.87", UID:"xxxxxxxx", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node xxxxxxxx status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"xxxxxxxxxx"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63605872340, nsec:72642091, loc:(*time.Location)(0x45be3e0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63605872513, nsec:790683013, loc:(*time.Location)(0x45be3e0)}}, Count:29, Type:"Normal"}': 'the server responded with the status code 415 but did not return more information (post events)' (will not retry!)
I'd like to understand if it's a setup issue or a real breaking change that prevent in-place upgrade...
Thanks

I've done an upgrade of 1.2.2 to 1.3.3 on CoreOS. This involved first upgrading the master server, then doing the nodes....
All went surprisingly smoothly...
Basically I followed:
https://coreos.com/kubernetes/docs/latest/kubernetes-upgrade.html

Related

Installing Couchbase with helm chart fails with Readiness probe

I've just migrated to M1 Macbook and tried to deploy couchbase using Couchbase Helm Chart on Kubernetes. https://docs.couchbase.com/operator/current/helm-setup-guide.html
But, couchbase server pod fails with message below
Readiness probe failed: dial tcp 172.17.0.7:8091: connect: connection
refused
Pod uses image: couchbase/server:7.0.2
Error from log file:
Starting Couchbase Server -- Web UI available at http://<ip>:8091
and logs available in /opt/couchbase/var/lib/couchbase/logs
runtime: failed to create new OS thread (have 2 already; errno=22)
fatal error: newosproc
runtime stack:
runtime.throw(0x4d8d66, 0x9)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/panic.go:596 +0x95
runtime.newosproc(0xc420028000, 0xc420038000)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/os_linux.go:163 +0x18c
runtime.newm(0x4df870, 0x0)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:1628 +0x137
runtime.main.func1()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:126 +0x36
runtime.systemstack(0x552700)
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:1132
goroutine 1 [running]:
runtime.systemstack_switch()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:281 fp=0xc420024788 sp=0xc420024780
runtime.main()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/proc.go:127 +0x6c fp=0xc4200247e0 sp=0xc420024788
runtime.goexit()
/home/couchbase/.cbdepscache/exploded/x86_64/go-1.8.5/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc4200247e8 sp=0xc4200247e0
{"init terminating in do_boot",{{badmatch,{error,{{shutdown,{failed_to_start_child,encryption_service,{port_terminated,normal}}},{ns_babysitter,start,[normal,[]]}}}},[{ns_babysitter_bootstrap,start,0,[{file,"src/ns_babysitter_bootstrap.erl"},{line,23}]},{init,start_em,1,[]},{init,do_boot,3,[]}]}}
init terminating in do_boot ({{badmatch,{error,{{_},{_}}}},[{ns_babysitter_bootstrap,start,0,[{_},{_}]},{init,start_em,1,[]},{init,do_boot,3,[]}]})
Any help would be appreciated.
It seems ARM64 version of Couchbase Server for MacOS has become available since Couchbase Server 7.1.1.
So, I ran the command below to install couchbase.
helm install couchbasev1 --values myvalues.yaml couchbase/couchbase-operator
myvalues.yaml:
cluster:
  image: couchbase/server:7.1.1
 
And it worked.

pod deployment fails with no clear message in logs

I have a Kubernetes cluster deployed locally to a node prepped by kubeadm.
I am experimenting with one of the pods. This pod fails to deploy, however I can't locate the cause of it. I have guesses as to what the problem is but I'd like to see something related in the Kubernetes logs
Here's what i have tried:
$kubectl logs nmnode-0-0 -c hadoop -n test
Error from server (NotFound): pods "nmnode-0-0" not found
$ kubectl get event -n test | grep nmnode
(empty results here)
$ journalctl -m |grep nmnode
and I get a bunch of repeated entries like the following. It talks about killing the pod but it gives no reason whatsoever for it
Aug 08 23:10:15 jeff-u16-3 kubelet[146562]: E0808 23:10:15.901051 146562 event.go:240] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"nmnode-0-0.15b92c3ff860aed6", GenerateName:"", Namespace:"test", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nmnode-0-0", UID:"743d2876-69cf-43bc-9227-aca603590147", APIVersion:"v1", ResourceVersion:"38152", FieldPath:"spec.containers{hadoop}"}, Reason:"Killing", Message:"Stopping container hadoop", Source:v1.EventSource{Component:"kubelet", Host:"jeff-u16-3"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xbf4b616dacae12d6, ext:2812562895486, loc:(*time.Location)(0x781e740)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbf4b616dacae12d6, ext:2812562895486, loc:(*time.Location)(0x781e740)}}, Count:1, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "nmnode-0-0.15b92c3ff860aed6" is forbidden: unable to create new content in namespace test because it is being terminated' (will not retry!)
The shorted version of the above message is this:
Reason:"Killing", Message:"Stopping container hadoop",
The cluster is still running. Do you know how I can get to the bottom of this?
Try to execute command below:
$ kubectl get pods --all-namespaces
Take a look if your pod was not created in a different namespace.
Most common reason of pod failures:
1. The container was never created because it failed to pull image.
2. The container never existed in the runtime, and the error reason is not in the "special error list", so the containerStatus was never set and kept as "no state".
3. Then the container was treated as "Unkown" and the pod was reported as Pending without any reason.
The containerStatus was always "no state" after each syncPod(), the status manager could never delete the pod even though the Deletiontimestamp was set.
Useful article: pod-failure.
try this command to get some hints
kubectl describe pod nmnode-0-0 -n test
share the output from
kubectl get po -n test

Puppet Server Can't restart

I am trying to restart puppet server after using it for sometime and below is the error.
Any idea how to overcome this error?
[root#chakriin56 user]# free -m
total used free shared buff/cache available
Mem: 7820 2889 4080 86 851 4579
Swap: 7812 0 7812
[root#chakriin56 user]# puppet status
{
"is_alive": true,
"version": "4.5.3"
}
[root#chakriin56 user]# systemctl status puppetserver.service
Unit puppetserver.service could not be found.
[root#chakriin56 user]# service puppetserver status
Redirecting to /bin/systemctl status puppetserver.service
Unit puppetserver.service could not be found.
[root#chakriin56 user]# systemctl restart puppetserver
Failed to restart puppetserver.service: Unit not found.
[root#chakriin56 user]#
You can enumerate the available services via the command puppet resource service. In that list, find the correct name for the Puppetserver service as it is installed on your machine, and use that to restart it.
If no such service is listed then your Puppetserver installation is damaged or non-standard. Supposing the former to be more likely, in that case you could try reinstalling. You might or might not be able to do that without first killing the running Puppetserver instance. Be sure to back up your configuration, data, and manifest set before attempting a reinstallation.

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved

Cause of "apiserver received an error that is not an unversioned" errors from kube-apiserver

On upgrading kubernetes from 1.0.6 to 1.1.3, I now see a bunch of the below errors during a rolling upgrade when any of my kube master or etcd hosts are down. We currently have a single master, with two etcd hosts.
2015-12-11T19:30:19.061+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.726490 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871210 (3871628)
2015-12-11T19:30:19.075+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.733331 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871156 (3871628)
2015-12-11T19:30:19.081+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.736569 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871623 (3871628)
2015-12-11T19:30:19.095+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.740328 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871622 (3871628)
2015-12-11T19:30:19.110+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.742972 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871210 (3871628)
I believe these errors are caused by this new feature in 1.1, the adding of the --watch-cache option by default. The errors cease at the end of the rolling upgrade.
I would like to know how to explain these errors, if they can be safely ignored, and how to change the system to avoid them in the future (for a longer term solution).
Yes - as you suggested, those errors are related to the new feature of serving watch from in-memory cache in apiserver.
So, if I understand correctly, what happened here is that:
- you upgraded (or in general restarted) apiserver
- this cause all the existing watch connections to terminate
- once apiserver started successfully, it regenerated its internal in-memory cache
- since watch can have some delays, it's possible that clients (that were renewing their watch connections) were slightly behind
- this caused generating such error, and forced clients to relist and start watching from the new point
IIUC, those errors were present only during upgrade and disappeared after - so that's good.
In other words - such errors may appear on update (or in general immediately after any restart of apiserver). In such situations they may be safely ignore.
In fact, those shouldn't probably be errors - we can probably change them to warnings.