Enable custom kubernetes scheduler for a namespace - kubernetes

I have a k8 job that brings up multiple pods. This job is used for load testing so all the pods need to come up at the same time. Job shouldn't be started until nodes are available for all pods to be scheduled.
I came across kube-batch https://github.com/kubernetes-sigs/kube-batch to do this scheduling. I have couple of questions:
1. How to enable kube-batch for only one namespace in a cluster?
2. Installed kube-batch by following the tutorial. But pods are failing on startup with below error. How to resolve this error?
I1204 20:07:55.911393 1 allocate.go:96] Queue <default> is overused, ignore it.
I1204 20:07:55.911399 1 allocate.go:194] Leaving Allocate ...
I1204 20:07:55.911407 1 backfill.go:41] Enter Backfill ...
I1204 20:07:55.911413 1 backfill.go:71] Leaving Backfill ...
E1204 20:07:55.911521 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/session.go:368
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins/gang/gang.go:154
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/framework.go:58
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:102
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:85
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/root1/servicecomb/go/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x148 pc=0x10ab979]

Not sure what you are trying to achive is doable. In my opinion what you can do is to modify the pods dockerfile to include Supervisord . Then in supervisord specify the commands you want to run when the pods come in running state using priority for supervisord.
Example
[program:api]
directory=/usr/local
command=go main.go
priority=100
autostart=true
autorestart=true
stderr_logfile=/var/log
stdout_logfile=/var/log

Related

Cannot allocate GPU in Slurm

I've got a problem to allocate gpu resourese at Slurm cluster.
specify 1 GPU and run as shown below, it says that gres resources cannot be allocated. The same result If more than one.
$ srun --gres=gpu:1 --pty bash
srun: error: Unable to create step for job 73: Invalid generic resource (gres) specification
compute node's gres information seems to come out correctly as below
$ sinfo -o "%20N %10c %10m %25f %10G "
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
gpu_svr[1-4 72 515484 (null) gpu:8
The Node configuration in the slurm.conf as below
/etc/slurm/slurm.conf
GresTypes=gpu
NodeName=gpu_svr1 NodeAddr=x.x.x.1 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr2 NodeAddr=x.x.x.2 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr3 NodeAddr=x.x.x.3 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
NodeName=gpu_svr4 NodeAddr=x.x.x.4 CPUs=72 RealMemory=515484 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 Gres=gpu:8 State=UNKNOWN
PartitionName=v100 Nodes=ALL Default=YES MaxTime=INFINITE State=UP
here is gres.conf on Compute nodes
gres.conf
NodeName=gpu_svr[1-4] Name=gpu File=/dev/nvidia[0-7]
Solved.
The following options should be stated in the slurm.conf
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
JobAcctGatherType=jobacct_gather/cgroup

Error instantiating chaincode in Hyperledger Fabric 1.4 over AKS kubernetes

I am trying to instantiate "sacc" chaincode (which comes with fabric samples) in an hyperledger fabric network deployed over kubernetes in AKS. After hours trying different adjustments, I've not been able to finish the task. I'm always getting the error:
Error: could not assemble transaction, err proposal response was not
successful, error code 500, msg timeout expired while starting
chaincode sacc:0.1 for transaction
Please note that there is no transaccion ID in the error (I've googled some similar cases, but in all of them, there was an ID for the transaction. Not my case, eventhough the error is the same)
The message in the orderer:
2019-07-23 12:40:13.649 UTC [orderer.common.broadcast] Handle -> WARN
047 Error reading from 10.1.0.45:52550: rpc error: code = Canceled
desc = context canceled 2019-07-23 12:40:13.649 UTC [comm.grpc.server]
1 -> INFO 048 streaming call completed {"grpc.start_time":
"2019-07-23T12:34:13.591Z", "grpc.service": "orderer.AtomicBroadcast",
"grpc.method": "Broadcast", "grpc.peer_address": "10.1.0.45:52550",
"error": "rpc error: code = Canceled desc = context canceled",
"grpc.code": "Canceled", "grpc.call_duration": "6m0.057953469s"}
I am calling for instantiation from inside a cli peer, defining variables CORE_PEER_LOCALMSPID, CORE_PEER_TLS_ROOTCERT_FILE, CORE_PEER_MSPCONFIGPATH, CORE_PEER_ADDRESS and ORDERER_CA with appropriate values before issuing the instantiation call:
peer chaincode instantiate -o -n sacc -v 0.1 -c
'{"Args":["init","hi","1"]}' -C mychannelname --tls 'true' --cafile
$ORDERER_CA
All the peers have declared a dockersocket volume pointing to
/run/docker.sock
All the peers have declared the variable CORE_VM_ENDPOINT to unix:///host/var/run/docker.sock
All the orgs were joined to the channel
All the peers have the chaincode installed
I can't see any further message/error in the cli, nor in the orderer, nor in the peers involved in the channel.
Any ideas on what can be going wrong? Or how could I continue troubleshooting the problem? Is it possible to see logs from the docker container that is being created by the peers? how?
Thanks

RabbitMQ Generic server rabbit_connection_tracking terminating

I have RabbitMQ cluster running on Kubernetes.
I found this error in one of RabbitMQ pods:
[error] <0.13899.4> ** Generic server rabbit_connection_tracking terminating
** Last message in was {'$gen_cast',{connection_closed,[{name,<<"10.182.60.4:52930 -> 10.182.60.93:5672">>},{pid,<0.13926.4>},{node,'rabbit#10.182.60.93'},{client_properties,[]}]}}
** When Server state == nostate
** Reason for termination ==
** {{aborted,{no_exists,['tracked_connection_on_node_rabbit#10.182.60.93',{'rabbit#10.182.60.93',<<"10.182.60.4:52930 -> 10.182.60.93:5672">>}]}},[{mnesia,abort,1,[{file,"mnesia.erl"},{line,355}]},{rabbit_connection_tracking,unregister_connection,1,[{file,"src/rabbit_connection_tracking.erl"},{line,282}]},{rabbit_connection_tracking,handle_cast,2,[{file,"src/rabbit_connection_tracking.erl"},{line,101}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
It's occurs in this pod only.
Where:
Kubernetes nodes count: 2
Rabbitmq pods count: 3
10.182.60.4: is one of K8s nodes
10.182.60.93: is the RabitMQ pods that has the error.
the error doesn't affect the flow of messages. but I can't understand what does it mean!
and why it occurs in this pod only!

Zalenium Readiness probe failed: HTTP probe failed with statuscode: 502

I am trying to deploy the zalenium helm chart in my newly deployed aks Kuberbetes (1.9.6) cluster in Azure. But I don't get it to work. The pod is giving the log below:
[bram#xforce zalenium]$ kubectl logs -f zalenium-zalenium-hub-6bbd86ff78-m25t2 Kubernetes service account found. Copying files for Dashboard... cp: cannot create regular file '/home/seluser/videos/index.html': Permission denied cp: cannot create directory '/home/seluser/videos/css': Permission denied cp: cannot create directory '/home/seluser/videos/js': Permission denied Starting Nginx reverse proxy... Starting Selenium Hub... ..........08:49:14.052 [main] INFO o.o.grid.selenium.GridLauncherV3 - Selenium build info: version: '3.12.0', revision: 'unknown' 08:49:14.120 [main] INFO o.o.grid.selenium.GridLauncherV3 - Launching Selenium Grid hub on port 4445 ...08:49:15.125 [main] INFO d.z.e.z.c.k.KubernetesContainerClient - Initialising Kubernetes support ..08:49:15.650 [main] WARN d.z.e.z.c.k.KubernetesContainerClient - Error initialising Kubernetes support. io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [zalenium-zalenium-hub-6bbd86ff78-m25t2] in namespace: [default] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:206) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.(KubernetesContainerClient.java:87) at de.zalando.ep.zalenium.container.ContainerFactory.createKubernetesContainerClient(ContainerFactory.java:35) at de.zalando.ep.zalenium.container.ContainerFactory.getContainerClient(ContainerFactory.java:22) at de.zalando.ep.zalenium.proxy.DockeredSeleniumStarter.(DockeredSeleniumStarter.java:59) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:74) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:62) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.openqa.grid.web.Hub.(Hub.java:93) at org.openqa.grid.selenium.GridLauncherV3$2.launch(GridLauncherV3.java:291) at org.openqa.grid.selenium.GridLauncherV3.launch(GridLauncherV3.java:122) at org.openqa.grid.selenium.GridLauncherV3.main(GridLauncherV3.java:82) Caused by: javax.net.ssl.SSLPeerUnverifiedException: Hostname kubernetes.default.svc not verified: certificate: sha256/OyzkRILuc6LAX4YnMAIGrRKLmVnDgLRvCasxGXDhSoc= DN: CN=client, O=system:masters subjectAltNames: [10.0.0.1] at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:308) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:268) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:160) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:256) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:134) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:113) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:56) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:107) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) at okhttp3.RealCall.execute(RealCall.java:77) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:313) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:296) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:770) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:195) ... 16 common frames omitted 08:49:15.651 [main] INFO d.z.e.z.c.k.KubernetesContainerClient - About to clean up any left over selenium pods created by Zalenium Usage: [options] Options: --debug, -debug : enables LogLevel.FINE. Default: false --version, -version Displays the version and exits. Default: false -browserTimeout in seconds : number of seconds a browser session is allowed to hang while a WebDriver command is running (example: driver.get(url)). If the timeout is reached while a WebDriver command is still processing, the session will quit. Minimum value is 60. An unspecified, zero, or negative value means wait indefinitely. -matcher, -capabilityMatcher class name : a class implementing the CapabilityMatcher interface. Specifies the logic the hub will follow to define whether a request can be assigned to a node. For example, if you want to have the matching process use regular expressions instead of exact match when specifying browser version. ALL nodes of a grid ecosystem would then use the same capabilityMatcher, as defined here. -cleanUpCycle in ms : specifies how often the hub will poll running proxies for timed-out (i.e. hung) threads. Must also specify "timeout" option -custom : comma separated key=value pairs for custom grid extensions. NOT RECOMMENDED -- may be deprecated in a future revision. Example: -custom myParamA=Value1,myParamB=Value2 -host IP or hostname : usually determined automatically. Most commonly useful in exotic network configurations (e.g. network with VPN) Default: 0.0.0.0 -hubConfig filename: a JSON file (following grid2 format), which defines the hub properties -jettyThreads, -jettyMaxThreads : max number of threads for Jetty. An unspecified, zero, or negative value means the Jetty default value (200) will be used. -log filename : the filename to use for logging. If omitted, will log to STDOUT -maxSession max number of tests that can run at the same time on the node, irrespective of the browser used -newSessionWaitTimeout in ms : The time after which a new test waiting for a node to become available will time out. When that happens, the test will throw an exception before attempting to start a browser. An unspecified, zero, or negative value means wait indefinitely. Default: 600000 -port : the port number the server will use. Default: 4445 -prioritizer class name : a class implementing the Prioritizer interface. Specify a custom Prioritizer if you want to sort the order in which new session requests are processed when there is a queue. Default to null ( no priority = FIFO ) -registry class name : a class implementing the GridRegistry interface. Specifies the registry the hub will use. Default: de.zalando.ep.zalenium.registry.ZaleniumRegistry -role options are [hub], [node], or [standalone]. Default: hub -servlet, -servlets : list of extra servlets the grid (hub or node) will make available. Specify multiple on the command line: -servlet tld.company.ServletA -servlet tld.company.ServletB. The servlet must exist in the path: /grid/admin/ServletA /grid/admin/ServletB -timeout, -sessionTimeout in seconds : Specifies the timeout before the server automatically kills a session that hasn't had any activity in the last X seconds. The test slot will then be released for another test to use. This is typically used to take care of client crashes. For grid hub/node roles, cleanUpCycle must also be set. -throwOnCapabilityNotPresent true or false : If true, the hub will reject all test requests if no compatible proxy is currently registered. If set to false, the request will queue until a node supporting the capability is registered with the grid. -withoutServlet, -withoutServlets : list of default (hub or node) servlets to disable. Advanced use cases only. Not all default servlets can be disabled. Specify multiple on the command line: -withoutServlet tld.company.ServletA -withoutServlet tld.company.ServletB org.openqa.grid.common.exception.GridConfigurationException: Error creating class with de.zalando.ep.zalenium.registry.ZaleniumRegistry : null at org.openqa.grid.web.Hub.(Hub.java:97) at org.openqa.grid.selenium.GridLauncherV3$2.launch(GridLauncherV3.java:291) at org.openqa.grid.selenium.GridLauncherV3.launch(GridLauncherV3.java:122) at org.openqa.grid.selenium.GridLauncherV3.main(GridLauncherV3.java:82) Caused by: java.lang.ExceptionInInitializerError at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:74) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:62) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.openqa.grid.web.Hub.(Hub.java:93) ... 3 more Caused by: java.lang.NullPointerException at java.util.TreeMap.putAll(TreeMap.java:313) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.withLabels(BaseOperation.java:411) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.withLabels(BaseOperation.java:48) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.deleteSeleniumPods(KubernetesContainerClient.java:393) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.initialiseContainerEnvironment(KubernetesContainerClient.java:339) at de.zalando.ep.zalenium.container.ContainerFactory.createKubernetesContainerClient(ContainerFactory.java:38) at de.zalando.ep.zalenium.container.ContainerFactory.getContainerClient(ContainerFactory.java:22) at de.zalando.ep.zalenium.proxy.DockeredSeleniumStarter.(DockeredSeleniumStarter.java:59) ... 11 more ...........................................................................................................................................................................................GridLauncher failed to start after 1 minute, failing... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 182 100 182 0 0 36103 0 --:--:-- --:--:-- --:--:-- 45500
A describe pod gives:
Warning Unhealthy 4m (x12 over 6m) kubelet, aks-agentpool-93668098-0 Readiness probe failed: HTTP probe failed with statuscode: 502
Zalenium Image Version(s):
dosel/zalenium:3
If using Kubernetes, specify your environment, and if relevant your manifests:
I use the templates as is from https://github.com/zalando/zalenium/tree/master/docs/k8s/helm
I guess it has to do something with rbac because of this part
"Error initialising Kubernetes support. io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [zalenium-zalenium-hub-6bbd86ff78-m25t2] in namespace: [default] failed. at "
I created a clusterrole and clusterrolebinding for the service account zalenium-zalenium that is automatically created by the Helm chart.
kubectl create clusterrole zalenium --verb=get,list,watch,update,delete,create,patch --resource=pods,deployments,secrets
kubectl create clusterrolebinding zalenium --clusterrole=zalnium --serviceaccount=zalenium-zalenium --namespace=default
Issue had to do with Azure's AKS and Kubernetes. It has been fixed.
See github issue 399
If the typo, mentioned by Ignacio, is not the case why don't you just do
kubectl create clusterrolebinding zalenium --clusterrole=cluster-admin serviceaccount=zalenium-zalenium
Note: no need to specify a namespace if creating cluster role binding, as it is cluster wide (role binding goes with namespaces).

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved