Cloudformation stack stuck in UPDATE_IN_PROGRESS - aws-cloudformation

My cloudformation stack that has been normally getting updated in a couple minutes keeps getting stuck. ECS seems to get stuck sometimes waiting for a service to be healthy, but the service came up and was stable.
13:59:20 UTC-0500 UPDATE_COMPLETE AWS::ECS::Service MyService
13:57:19 UTC-0500 UPDATE_IN_PROGRESS AWS::ECS::Service MyService
13:57:14 UTC-0500 UPDATE_COMPLETE AWS::Lambda::Function MyFunction
13:57:13 UTC-0500 UPDATE_COMPLETE AWS::ECS::TaskDefinition MyTaskDefinition
13:57:13 UTC-0500 UPDATE_IN_PROGRESS AWS::ECS::TaskDefinition MyTaskDefinition Resource creation Initiated
13:57:13 UTC-0500 UPDATE_IN_PROGRESS AWS::ECS::TaskDefinition MyTaskDefinition Requested update requires the creation of a new physical resource; hence creating one.
13:57:10 UTC-0500 UPDATE_IN_PROGRESS AWS::Lambda::Function MyFunction
13:57:10 UTC-0500 CREATE_IN_PROGRESS AWS::ApiGateway::DomainName MyDomainName
13:57:05 UTC-0500 UPDATE_IN_PROGRESS AWS::CloudFormation::Stack my-stack User Initiated

For people looking at this and not being able to progress. I was able to delete the stack by canceling the update with the cli.
aws --region REGION cloudformation cancel-update-stack --stack-name STACKNAME
then I was able to delete the the stack.

The issue appears to be that a DomainName entity was being created with a cert in cloudfront, which takes up to 40 minutes. Was able to see it initializing in APIGW/Custom Domain Names.

Related

Consul agent on kubernetes, on node or pod?

I deployed an aws eks cluster via terraform. I also deployed Consul following hasicorp’s tutorial and I see the nodes in consul’s UI.
Now I’m wondering how al the consul agents will know about the pods I deploy? I deploy something and it’s not shown anywhere on consul.
I can’t find any documentation as to how to register pods (services) on consul via the node’s consul agent, do I need to configure that somewhere? Should I not use the node’s agent and register the service straight from the pod? Hashicorp discourages this since it may increase resource utilization depending on how many pods one deploy on a given node. But then how does the node’s agent know about my services deployed on that node?
Moreover, when I deploy a pod in a node and ssh into the node, and install consul, consul’s agent can’t find the consul server (as opposed from the node, which can find it)
EDIT:
Bottom line is I can't find WHERE to add the configuration. If I execute ON THE POD:
consul members
It works properly and I get:
Node Address Status Type Build Protocol DC Segment
consul-consul-server-0 10.0.103.23:8301 alive server 1.10.0 2 full <all>
consul-consul-server-1 10.0.101.151:8301 alive server 1.10.0 2 full <all>
consul-consul-server-2 10.0.102.112:8301 alive server 1.10.0 2 full <all>
ip-10-0-101-129.ec2.internal 10.0.101.70:8301 alive client 1.10.0 2 full <default>
ip-10-0-102-175.ec2.internal 10.0.102.244:8301 alive client 1.10.0 2 full <default>
ip-10-0-103-240.ec2.internal 10.0.103.245:8301 alive client 1.10.0 2 full <default>
ip-10-0-3-223.ec2.internal 10.0.3.249:8301 alive client 1.10.0 2 full <default>
But if i execute:
# consul agent -datacenter=voip-full -config-dir=/etc/consul.d/ -log-file=log-file -advertise=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)
I get the following error:
==> Starting Consul agent...
Version: '1.10.1'
Node ID: 'f10070e7-9910-06c7-0e12-6edb6cc4c9b9'
Node name: 'ip-10-0-3-223.ec2.internal'
Datacenter: 'voip-full' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.0.3.223 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-08-16T18:23:06.936Z [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.936Z [WARN] agent: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.946Z [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.947Z [WARN] agent.auto_config: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.948Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ip-10-0-3-223.ec2.internal 10.0.3.223
2021-08-16T18:23:06.948Z [INFO] agent.router: Initializing LAN area manager
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp
2021-08-16T18:23:06.950Z [WARN] agent.client.serf.lan: serf: Failed to re-join any previously known node
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-08-16T18:23:06.951Z [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-08-16T18:23:06.951Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-08-16T18:23:06.953Z [INFO] agent: started state syncer
2021-08-16T18:23:06.953Z [INFO] agent: Consul agent running!
2021-08-16T18:23:06.953Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:06.954Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2021-08-16T18:23:34.169Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:34.169Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
So where to add the config?
I also tried adding a service in k8s pointing to the pod, but the service doesn't come up on consul's UI...
What do you guys recommend?
Thanks
Consul knows where these services are located because each service
registers with its local Consul client. Operators can register
services manually, configuration management tools can register
services when they are deployed, or container orchestration platforms
can register services automatically via integrations.
if you planning to use manual option you have to register the service into the consul.
Something like
echo '{
"service": {
"name": "web",
"tags": [
"rails"
],
"port": 80
}
}' > ./consul.d/web.json
You can find the good example at : https://thenewstack.io/implementing-service-discovery-of-microservices-with-consul/
Also this is a very nice document for having detailed configuration of the health check and service discovery : https://cloud.spring.io/spring-cloud-consul/multi/multi_spring-cloud-consul-discovery.html
Official document : https://learn.hashicorp.com/tutorials/consul/get-started-service-discovery
BTW, I was finally able to figure out the issue.
consul-dns is not deployed by default, i had to manually deploy it, then forward all .consul requests from coredns to consul-dns.
All is working now. Thanks!

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

Zalenium Readiness probe failed: HTTP probe failed with statuscode: 502

I am trying to deploy the zalenium helm chart in my newly deployed aks Kuberbetes (1.9.6) cluster in Azure. But I don't get it to work. The pod is giving the log below:
[bram#xforce zalenium]$ kubectl logs -f zalenium-zalenium-hub-6bbd86ff78-m25t2 Kubernetes service account found. Copying files for Dashboard... cp: cannot create regular file '/home/seluser/videos/index.html': Permission denied cp: cannot create directory '/home/seluser/videos/css': Permission denied cp: cannot create directory '/home/seluser/videos/js': Permission denied Starting Nginx reverse proxy... Starting Selenium Hub... ..........08:49:14.052 [main] INFO o.o.grid.selenium.GridLauncherV3 - Selenium build info: version: '3.12.0', revision: 'unknown' 08:49:14.120 [main] INFO o.o.grid.selenium.GridLauncherV3 - Launching Selenium Grid hub on port 4445 ...08:49:15.125 [main] INFO d.z.e.z.c.k.KubernetesContainerClient - Initialising Kubernetes support ..08:49:15.650 [main] WARN d.z.e.z.c.k.KubernetesContainerClient - Error initialising Kubernetes support. io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [zalenium-zalenium-hub-6bbd86ff78-m25t2] in namespace: [default] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:206) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.(KubernetesContainerClient.java:87) at de.zalando.ep.zalenium.container.ContainerFactory.createKubernetesContainerClient(ContainerFactory.java:35) at de.zalando.ep.zalenium.container.ContainerFactory.getContainerClient(ContainerFactory.java:22) at de.zalando.ep.zalenium.proxy.DockeredSeleniumStarter.(DockeredSeleniumStarter.java:59) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:74) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:62) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.openqa.grid.web.Hub.(Hub.java:93) at org.openqa.grid.selenium.GridLauncherV3$2.launch(GridLauncherV3.java:291) at org.openqa.grid.selenium.GridLauncherV3.launch(GridLauncherV3.java:122) at org.openqa.grid.selenium.GridLauncherV3.main(GridLauncherV3.java:82) Caused by: javax.net.ssl.SSLPeerUnverifiedException: Hostname kubernetes.default.svc not verified: certificate: sha256/OyzkRILuc6LAX4YnMAIGrRKLmVnDgLRvCasxGXDhSoc= DN: CN=client, O=system:masters subjectAltNames: [10.0.0.1] at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:308) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:268) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:160) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:256) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:134) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:113) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:56) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:107) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) at okhttp3.RealCall.execute(RealCall.java:77) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:313) at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:296) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:770) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:195) ... 16 common frames omitted 08:49:15.651 [main] INFO d.z.e.z.c.k.KubernetesContainerClient - About to clean up any left over selenium pods created by Zalenium Usage: [options] Options: --debug, -debug : enables LogLevel.FINE. Default: false --version, -version Displays the version and exits. Default: false -browserTimeout in seconds : number of seconds a browser session is allowed to hang while a WebDriver command is running (example: driver.get(url)). If the timeout is reached while a WebDriver command is still processing, the session will quit. Minimum value is 60. An unspecified, zero, or negative value means wait indefinitely. -matcher, -capabilityMatcher class name : a class implementing the CapabilityMatcher interface. Specifies the logic the hub will follow to define whether a request can be assigned to a node. For example, if you want to have the matching process use regular expressions instead of exact match when specifying browser version. ALL nodes of a grid ecosystem would then use the same capabilityMatcher, as defined here. -cleanUpCycle in ms : specifies how often the hub will poll running proxies for timed-out (i.e. hung) threads. Must also specify "timeout" option -custom : comma separated key=value pairs for custom grid extensions. NOT RECOMMENDED -- may be deprecated in a future revision. Example: -custom myParamA=Value1,myParamB=Value2 -host IP or hostname : usually determined automatically. Most commonly useful in exotic network configurations (e.g. network with VPN) Default: 0.0.0.0 -hubConfig filename: a JSON file (following grid2 format), which defines the hub properties -jettyThreads, -jettyMaxThreads : max number of threads for Jetty. An unspecified, zero, or negative value means the Jetty default value (200) will be used. -log filename : the filename to use for logging. If omitted, will log to STDOUT -maxSession max number of tests that can run at the same time on the node, irrespective of the browser used -newSessionWaitTimeout in ms : The time after which a new test waiting for a node to become available will time out. When that happens, the test will throw an exception before attempting to start a browser. An unspecified, zero, or negative value means wait indefinitely. Default: 600000 -port : the port number the server will use. Default: 4445 -prioritizer class name : a class implementing the Prioritizer interface. Specify a custom Prioritizer if you want to sort the order in which new session requests are processed when there is a queue. Default to null ( no priority = FIFO ) -registry class name : a class implementing the GridRegistry interface. Specifies the registry the hub will use. Default: de.zalando.ep.zalenium.registry.ZaleniumRegistry -role options are [hub], [node], or [standalone]. Default: hub -servlet, -servlets : list of extra servlets the grid (hub or node) will make available. Specify multiple on the command line: -servlet tld.company.ServletA -servlet tld.company.ServletB. The servlet must exist in the path: /grid/admin/ServletA /grid/admin/ServletB -timeout, -sessionTimeout in seconds : Specifies the timeout before the server automatically kills a session that hasn't had any activity in the last X seconds. The test slot will then be released for another test to use. This is typically used to take care of client crashes. For grid hub/node roles, cleanUpCycle must also be set. -throwOnCapabilityNotPresent true or false : If true, the hub will reject all test requests if no compatible proxy is currently registered. If set to false, the request will queue until a node supporting the capability is registered with the grid. -withoutServlet, -withoutServlets : list of default (hub or node) servlets to disable. Advanced use cases only. Not all default servlets can be disabled. Specify multiple on the command line: -withoutServlet tld.company.ServletA -withoutServlet tld.company.ServletB org.openqa.grid.common.exception.GridConfigurationException: Error creating class with de.zalando.ep.zalenium.registry.ZaleniumRegistry : null at org.openqa.grid.web.Hub.(Hub.java:97) at org.openqa.grid.selenium.GridLauncherV3$2.launch(GridLauncherV3.java:291) at org.openqa.grid.selenium.GridLauncherV3.launch(GridLauncherV3.java:122) at org.openqa.grid.selenium.GridLauncherV3.main(GridLauncherV3.java:82) Caused by: java.lang.ExceptionInInitializerError at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:74) at de.zalando.ep.zalenium.registry.ZaleniumRegistry.(ZaleniumRegistry.java:62) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at org.openqa.grid.web.Hub.(Hub.java:93) ... 3 more Caused by: java.lang.NullPointerException at java.util.TreeMap.putAll(TreeMap.java:313) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.withLabels(BaseOperation.java:411) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.withLabels(BaseOperation.java:48) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.deleteSeleniumPods(KubernetesContainerClient.java:393) at de.zalando.ep.zalenium.container.kubernetes.KubernetesContainerClient.initialiseContainerEnvironment(KubernetesContainerClient.java:339) at de.zalando.ep.zalenium.container.ContainerFactory.createKubernetesContainerClient(ContainerFactory.java:38) at de.zalando.ep.zalenium.container.ContainerFactory.getContainerClient(ContainerFactory.java:22) at de.zalando.ep.zalenium.proxy.DockeredSeleniumStarter.(DockeredSeleniumStarter.java:59) ... 11 more ...........................................................................................................................................................................................GridLauncher failed to start after 1 minute, failing... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 182 100 182 0 0 36103 0 --:--:-- --:--:-- --:--:-- 45500
A describe pod gives:
Warning Unhealthy 4m (x12 over 6m) kubelet, aks-agentpool-93668098-0 Readiness probe failed: HTTP probe failed with statuscode: 502
Zalenium Image Version(s):
dosel/zalenium:3
If using Kubernetes, specify your environment, and if relevant your manifests:
I use the templates as is from https://github.com/zalando/zalenium/tree/master/docs/k8s/helm
I guess it has to do something with rbac because of this part
"Error initialising Kubernetes support. io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [zalenium-zalenium-hub-6bbd86ff78-m25t2] in namespace: [default] failed. at "
I created a clusterrole and clusterrolebinding for the service account zalenium-zalenium that is automatically created by the Helm chart.
kubectl create clusterrole zalenium --verb=get,list,watch,update,delete,create,patch --resource=pods,deployments,secrets
kubectl create clusterrolebinding zalenium --clusterrole=zalnium --serviceaccount=zalenium-zalenium --namespace=default
Issue had to do with Azure's AKS and Kubernetes. It has been fixed.
See github issue 399
If the typo, mentioned by Ignacio, is not the case why don't you just do
kubectl create clusterrolebinding zalenium --clusterrole=cluster-admin serviceaccount=zalenium-zalenium
Note: no need to specify a namespace if creating cluster role binding, as it is cluster wide (role binding goes with namespaces).

Kubernetes Replication Controller Integration Test Failure

I am seeing the following kubernetes integration tests fail pretty consistently, about 90% of the time on RHEL 7.2, Fedora 24, and CentOS7.1:
test/integration/garbagecollector
test/integration/replicationcontroller
They seem to be due to an etcd failure. My online queries lead me to believe this may also encompass an apiserver issue. My setup is simple, I install/start docker, install go, clone the kubernetes repo from github, use hack/install-etcd.sh from the repo and add it to path, get ginkgo, gomega and go-bindata, then run 'make test-integration'. I don't manually change anything or add any custom files/configs. Has anyone run into these issues and know a solution? The only mention of this issue I have seen online has been deemed a flake and has no listed solution, but I run into this issue almost every single test run. Pieces of the error are below, I can give more if needed:
Garbage Collector:
\*many lines from garbagecollector.go that look good*
I0920 14:42:39.725768 11823 garbagecollector.go:479] create storage for resource { v1 secrets}
I0920 14:42:39.725786 11823 garbagecollector.go:479] create storage for resource { v1 serviceaccounts}
I0920 14:42:39.725803 11823 garbagecollector.go:479] create storage for resource { v1 services}
I0920 14:43:09.565529 11823 trace.go:61] Trace "List *rbac.ClusterRoleList" (started 2016-09-20 14:42:39.565113203 -0400 EDT):
[2.564µs] [2.564µs] About to list etcd node
[30.000353492s] [30.000350928s] Etcd node listed
[30.000361771s] [8.279µs] END
E0920 14:43:09.566770 11823 cacher.go:258] unexpected ListAndWatch error: pkg/storage/cacher.go:198: Failed to list *rbac.RoleBinding: client: etcd cluster is unavailable or misconfigured
\*repeats over and over with different thing failed to list*
Replication Controller:
I0920 14:35:16.907283 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907293 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907298 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907303 10482 replication_controller.go:481] replication controller worker shutting down
I0920 14:35:16.907307 10482 replication_controller.go:481] replication controller worker shutting down
E0920 14:35:16.948417 10482 util.go:45] Metric for replication_controller already registered
--- FAIL: TestUpdateLabelToBeAdopted (30.07s)
replicationcontroller_test.go:270: Failed to create replication controller rc: Timeout: request did not complete within allowed duration
E0920 14:44:06.820506 12053 storage_rbac.go:116] unable to initialize clusterroles: client: etcd cluster is unavailable or misconfigured
There are no files in /var/log that even start with kube.
Thanks in advance!
I increased the limits on the number of file descriptors and haven't seen this issue since. So, gonna go ahead and call this solved

Cause of "apiserver received an error that is not an unversioned" errors from kube-apiserver

On upgrading kubernetes from 1.0.6 to 1.1.3, I now see a bunch of the below errors during a rolling upgrade when any of my kube master or etcd hosts are down. We currently have a single master, with two etcd hosts.
2015-12-11T19:30:19.061+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.726490 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871210 (3871628)
2015-12-11T19:30:19.075+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.733331 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871156 (3871628)
2015-12-11T19:30:19.081+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.736569 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871623 (3871628)
2015-12-11T19:30:19.095+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.740328 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871622 (3871628)
2015-12-11T19:30:19.110+00:00 kube-master1 [err] [kube-apiserver] E1211 19:30:18.742972 26551 errors.go:62] apiserver received an error that is not an unversioned.Status: too old resource version: 3871210 (3871628)
I believe these errors are caused by this new feature in 1.1, the adding of the --watch-cache option by default. The errors cease at the end of the rolling upgrade.
I would like to know how to explain these errors, if they can be safely ignored, and how to change the system to avoid them in the future (for a longer term solution).
Yes - as you suggested, those errors are related to the new feature of serving watch from in-memory cache in apiserver.
So, if I understand correctly, what happened here is that:
- you upgraded (or in general restarted) apiserver
- this cause all the existing watch connections to terminate
- once apiserver started successfully, it regenerated its internal in-memory cache
- since watch can have some delays, it's possible that clients (that were renewing their watch connections) were slightly behind
- this caused generating such error, and forced clients to relist and start watching from the new point
IIUC, those errors were present only during upgrade and disappeared after - so that's good.
In other words - such errors may appear on update (or in general immediately after any restart of apiserver). In such situations they may be safely ignore.
In fact, those shouldn't probably be errors - we can probably change them to warnings.