Tableau-cluster-linux Cloudformation template failing while creating bastion autoscaling group - aws-cloudformation

The following resource(s) failed to create: [BastionAutoScalingGroup].
BastionAutoScalingGroup CREATE_FAILED Received 1 FAILURE signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
log from bastionHost:-
grep -ni 'error\|failure' $(sudo find /var/log -name cfn* -or -name cloud-init*)
/var/log/cloud-init.log:458:Apr 22 08:19:08 cloud-init[3012]: main.py[DEBUG]: Ran 13 modules with 0 failures
/var/log/cloud-init.log:560:Apr 22 08:19:19 cloud-init[3208]: main.py[DEBUG]: Ran 10 modules with 0 failures
/var/log/cloud-init.log:594:ProcessExecutionError: Unexpected error while running command.
/var/log/cloud-init.log:615:RuntimeError: Runparts: 1 failures in 1 attempted commands
/var/log/cloud-init.log:651:Apr 22 08:20:09 cloud-init[3415]: main.py[DEBUG]: Ran 9 modules with 1 failures
/var/log/cloud-init-output.log:212:--> Processing Dependency: perl(Error) for package: perl-Git-2.23.1-1.amzn2.0.2.noarch
/var/log/cloud-init-output.log:215:---> Package perl-Error.noarch 1:0.17020-2.amzn2 will be installed
/var/log/cloud-init-output.log:230: perl-Error noarch 1:0.17020-2.amzn2 amzn2-core 32 k
/var/log/cloud-init-output.log:249: Installing : 1:perl-Error-0.17020-2.amzn2.noarch 3/8
/var/log/cloud-init-output.log:261: Verifying : 1:perl-Error-0.17020-2.amzn2.noarch 7/8
/var/log/cloud-init-output.log:272: perl-Error.noarch 1:0.17020-2.amzn2
/var/log/cloud-init-output.log:842:Error occurred during build: Command b-bootstrap failed
/var/log/cfn-init.log:20:2020-04-22 08:20:08,653 [ERROR] Command b-bootstrap (./bastion_bootstrap.sh --banner https://aws-quickstart.s3.us-east-1.amazonaws.com/quickstart-linux-bastion/scripts/banner_message.txt --enable false --tcp-forwarding false --x11-forwarding false) failed
/var/log/cfn-init.log:35:2020-04-22 08:20:08,654 [ERROR] Error encountered during build of config: Command b-bootstrap failed
/var/log/cfn-init.log:42: raise ToolError(u"Command %s failed" % name)
/var/log/cfn-init.log:43:ToolError: Command b-bootstrap failed
/var/log/cfn-init.log:44:2020-04-22 08:20:08,656 [ERROR] -----------------------BUILD FAILED!------------------------
/var/log/cfn-init.log:45:2020-04-22 08:20:08,660 [ERROR] Unhandled exception during build: Command b-bootstrap failed
/var/log/cfn-init.log:58: raise ToolError(u"Command %s failed" % name)
/var/log/cfn-init.log:59:ToolError: Command b-bootstrap failed
/var/log/cfn-init.log:61:2020-04-22 08:20:08,882 [DEBUG] Signaling resource BastionAutoScalingGroup in stack Tableau-ubuntu-BastionHost-1FMT6T80AVICN with unique ID i-0ba02ad1171cdc569 and status FAILURE
/var/log/cfn-wire.log:3:2020-04-22 08:20:08,938 [DEBUG] Response: 200 https://cloudformation.us-east-2.amazonaws.com/?Status=FAILURE&ContentType=JSON&StackName=Tableau-ubuntu-BastionHost-1FMT6T80AVICN&Version=2010-05-15&UniqueId=i-0ba02ad1171cdc569&Action=SignalResource&LogicalResourceId=BastionAutoScalingGroup [headers: {'x-amzn-requestid': '9a5ea98d-c6bb-49c9-a6d1-0b663848c9f3', 'date': 'Wed, 22 Apr 2020 08:20:08 GMT', 'content-length': '100', 'content-type': 'application/json'}]
/var/log/cfn-init-cmd.log:23:2020-04-22 08:20:08,653 P12339 [ERROR] Exited with error code 6
logs from primary:-
grep -ni 'error\|failure' $(sudo find /var/log -name cfn* -or -name cloud-init*)
/var/log/cloud-init-output.log:2099:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2113:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2127:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2141:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2155:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2169:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2183:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2197:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2211:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2225:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2239:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2254:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2268:There are 1 topology validation errors/warnings.
/var/log/cloud-init-output.log:2281:There are 2 topology validation errors/warnings.
/var/log/cloud-init-output.log:2305:There are 3 topology validation errors/warnings.
/var/log/cloud-init-output.log:2475:There are 3 topology validation errors/warnings.
/var/log/cloud-init-output.log:2485:Error: This configuration is invalid. Configuring Backgrounder node roles for
/var/log/cloud-init-output.log:2496:Error: This configuration is invalid. Configuring Backgrounder node roles for
/var/log/cloud-init.log:856:2020-04-22 08:18:10,148 - main.py[DEBUG]: Ran 15 modules with 0 failures
/var/log/cloud-init.log:981:2020-04-22 08:18:23,213 - main.py[DEBUG]: Ran 14 modules with 0 failures
/var/log/cloud-init.log:1078:cloudinit.util.ProcessExecutionError: Unexpected error while running command.
/var/log/cloud-init.log:1105:RuntimeError: Runparts: 1 failures in 1 attempted commands
/var/log/cloud-init.log:1141:2020-04-22 08:39:47,298 - main.py[DEBUG]: Ran 20 modules with 1 failures

Related

gitlab job pod exits unexpectedly

I currently have gitlab runner deployed in my kubernetes cluster with 2 replicas.
When I run a job in gitlab, the runners are successful in spawning pods that run the pipeline. But in some cases, after the pipeline runs the job, I suddenly get the error
Running after_script
00:00
Uploading artifacts for failed job
00:00
Cleaning up project directory and file based variables
00:00
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found
When I have a look at the runner logs, all I see is
gitlab-runners-exchange-587cdbf898-pkgt2 | grep "runner-hzfiusrx-project-37057717-concurrent-21gs8vm"
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: command terminated with exit code 137. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error while executing file based variables removal script error=couldn't get pod details: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found job=2986450184 project=37057717 runner=hzFiusRx
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found duration_s=2067.525269137 job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Failed to process runner builds=32 error=pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found executor=kubernetes runner=hzFiusRx
Im trying to understand the issue here.
My kubernetes runner config is
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = true
image = "ubuntu:20.04"
namespace = "exchange"
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "5"
memory_request = "25Gi"
The nodes on which the job pods get scheduled have the following capacity
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32523380Ki
pods: 58
So what exactly might be the issue here ? The cpu and memory dimensioning for the nodes seem correct.
looking at the utilization, everything seems good too
So what might be the issue here ? Is it that kubernetes/gitlab is not gracefully killing the job pod ? Or does it need more memory ?

zookeeper-server-start throws exception: ERROR Invalid config, exiting abnormally

I just installed Kafka using brew install kafka and it was successful. Now when I try to start zookeeper I get the following error:
 ~/ zookeeper-server-start /opt/homebrew/etc/kafka/server.properties
[2022-08-17 12:03:46,961] INFO Reading configuration from: /opt/homebrew/etc/kafka/server.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2022-08-17 12:03:46,964] ERROR Invalid config, exiting abnormally (org.apache.zookeeper.server.quorum.QuorumPeerMain)
org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /opt/homebrew/etc/kafka/server.properties
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:198)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:124)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
Caused by: java.lang.IllegalArgumentException: dataDir is not set
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:444)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:194)
... 2 more
Invalid config, exiting abnormally
[2022-08-17 12:03:46,965] INFO ZooKeeper audit is disabled. (org.apache.zookeeper.audit.ZKAuditProvider)
[2022-08-17 12:03:46,966] ERROR Exiting JVM with code 2 (org.apache.zookeeper.util.ServiceUtils)
 ~/
The server.properties file exists:
-rw-r--r-- 1 atael admin 6912 Aug 17 11:51 /opt/homebrew/etc/kafka/server.properties
My environment is a M1 Mac if that matters.
Anyone seems this before and can help?
Thanks
Andy
Found the problem. The right command should be zookeeper-server-start /opt/homebrew/etc/kafka/zookeeper.properties

Ceph got timeout

I upgraded my proxmox cluster from 6 to 7. After upgrading, Ceph services are not responding. Any command in the console, for example ceph -s hangs and does not return any result. And in the web interface I see Error got timeout (500)
enter image description here
In the log I saw the error cluster [WRN] overall HEALTH_WARN noout flag (s) set; 1/4 mons down, quorum server-1, server-2, server-3
But when I try to execute ceph osd unset noout I see this error
2021-12-03T18: 59: 17.068 + 0300 7fac8c30d700 0 monclient (hunting): authenticate timed out after 300

How to use the Akka sample cluster kubernetes with Scala and minikube?

I am trying to run the akka-sample-cluster-kubernetes-scala as it is recommended to deploy an Akka cluster on to minikube using akka-management-cluster-bootstrap. After run every step recommended on the README file I can see the pods running on my kubectl output:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
appka-8b88f7bdd-485nx 1/1 Running 0 48m
appka-8b88f7bdd-4blrv 1/1 Running 0 48m
appka-8b88f7bdd-7qlc9 1/1 Running 0 48m
When I execute the ./scripts/test.sh it seems to fail on the last step:
"No 3 MemberUp log events found"
And I cannot connect to the given address said on the README file. The error:
$ curl http://127.0.0.1:8558/cluster/members/
curl: (7) Failed to connect to 127.0.0.1 port 8558: Connection refused
From now on I describe how I try to find the reason that I cannot use the sample akka + kubernetes project. I am trying to find the cause of the error above mentioned. I suppose I have to execute sbt run, even it is not mentioned on the sample project. And them I get the following error with respect to the ${REQUIRED_CONTACT_POINT_NR} variable at application.conf:
[error] Exception in thread "main"
com.typesafe.config.ConfigException$UnresolvedSubstitution:
application.conf #
jar:file:/home/felipe/workspace-idea/akka-sample-cluster-kubernetes-scala/target/bg-jobs/sbt_12a05599/job-1/target/d9ddd12d/64fe375d/akka-sample-cluster-kubernetes_2.13-0.0.0-70-49d6a855-20210104-1057.jar!/application.conf:
19: Could not resolve substitution to a value:
${REQUIRED_CONTACT_POINT_NR}
#management-config
akka.management {
cluster.bootstrap {
contact-point-discovery {
# pick the discovery method you'd like to use:
discovery-method = kubernetes-api
required-contact-point-nr = ${REQUIRED_CONTACT_POINT_NR}
}
}
}
#management-config
So, I suppose that it is not getting the configuration from the kubernetes/akka-cluster.yml file: name: REQUIRED_CONTACT_POINT_NR. Changing it to required-contact-point-nr = 3 or 4 I get the error:
[error] SLF4J: A number (4) of logging calls during the initialization phase have been intercepted and are
[error] SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.
[error] SLF4J: See also http://www.slf4j.org/codes.html#replay
...
[info] [2021-01-04 11:00:57,373] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Shutting down remote daemon. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.373UTC}
[info] [2021-01-04 11:00:57,376] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.375UTC}
[info] [2021-01-04 11:00:57,414] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [] [appka-akka.actor.default-dispatcher-3] - Remoting shut down. MDC: {akkaAddress=akka://appka#127.0.0.1:25520, sourceThread=appka-akka.remote.default-remote-dispatcher-9, akkaSource=akka://appka#127.0.0.1:25520/system/remoting-terminator, sourceActorSystem=appka, akkaTimestamp=10:00:57.414UTC}
[error] Nonzero exit code returned from runner: 255
[error] (Compile / run) Nonzero exit code returned from runner: 255
[error] Total time: 6 s, completed Jan 4, 2021 11:00:57 AM
You are getting your contact point error because you are trying to use sbt run. sbt run will run a single instance outside of minikube, which isn't what you want. And since it's running outside of Minikube it won't pick up the environment variables being set in the container spec. The scripts do the build/deploy and you should not need to run sbt manually.
Also, the main error is not the connection to 8558, I don't believe that the configuration exposes that admin port outside of minikube.
The fact that all three containers report a status Running indicates that you may actually have a running cluster and the test script may just be missing the messages in the logs. As others have said in comments, the logs from one of the containers would be helpful in determining whether you have a working cluster, and diagnosing any problems in cluster formation.
The answer and the comments on my questions are right. I don't need to run sbt run to visualize the output at the web browser. What I was missing is to port-forward the current port of the cluster nodes to the output port. This is not specified on the akka-sample-cluster-kubernetes-scala, but I believe that is because they run directly on the Google Kubernetes Engine platform and not inside minikube first.
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
appka-1 appka-8b88f7bdd-97zfl 1/1 Running 0 9m10s
appka-1 appka-8b88f7bdd-bhv44 1/1 Running 0 9m10s
appka-1 appka-8b88f7bdd-ff76s 1/1 Running 0 9m10s
$ #### THIS COMMAND IS NOT IN THE README FILE OF THE DEMO ####
$ kubectl port-forward appka-8b88f7bdd-ff76s 8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
now I can see the output:
$ http GET http://127.0.0.1:8080/
HTTP/1.1 200 OK
Content-Length: 11
Content-Type: text/plain; charset=UTF-8
Date: Wed, 06 Jan 2021 10:59:40 GMT
Server: akka-http/10.2.1
Hello world

Simple Container Toolchain - failed to deploy : Exceeded maximum number of retries

I'm trying to run the deploy pipeline created by the Simple Container Toolchain example. The output of the deployment log is:
...
2017-07-03 15:49:43 UTC : creating group: /tmp/extension_content/cf ic group create --name hello-containers-XXXX_2 --publish 80 --desired 2 --min 1 --max 6 registry.ng.bluemix.net/chstest/hello-containers-XXXX:1
OK
The creation of the container group was requested.
The container group "hello-containers-XXXX_2" (ID: YYYY) was created.
Minimum container instances: 1
Maximum container instances: 6
Desired container instances: 2
2017-07-03 15:49:49 UTC : hello-containers-XXXX_2 is 'CREATE_IN_PROGRESS'
2017-07-03 15:49:53 UTC : hello-containers-XXXX_2 is 'CREATE_IN_PROGRESS'
...
... CREATE_IN_PROGRESS message repeated about 150 times
...
2017-07-03 16:02:51 UTC : hello-containers-XXXX_2 is 'CREATE_IN_PROGRESS'
2017-07-03 16:02:55 UTC : hello-containers-XXXX_2 is 'CREATE_IN_PROGRESS'
2017-07-03 16:02:58 UTC : Create group is not completed and stays in status 'CREATE_IN_PROGRESS'
2017-07-03 16:02:58 UTC : Failed to deploy group
To send notifications, set SLACK_WEBHOOK_PATH or HIP_CHAT_TOKEN in the environment
Finished: FAILED
If I navigate to the container dashboard in Bluemix, I see the following error log:
Group failed
Resource CREATE failed: ResourceInError: resources.asg.resources.i2jghszuv3br.resources.server: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance ZZZZ. Last exception: [u'Traceback (most recent call last): \n', u' File "/opt/bbc/openstack-12.1.90/nova/local/lib/python2.7/site-packages, Code: 500"
How can I debug this further?
Please open a Bluemix support ticket against the Containers SRE Team. Do this by logging into your Bluemix account and in the upper right hand corner there is an option for "Support" = "Add Ticket".