Error submitting COMPSs application to a Cluster

Error submitting COMPSs application to a Cluster - hpc

I am getting an error when submitting a COMPSs application at a cluster
I execute this
$ enqueue_compss --lang=python /home/bsc21/bsc21863/simple.py 3
And receive this error,
/gpfs/apps/MN3/COMPSs/1.4/Runtime/scripts/user/../queues/lsf/submit.sh:
line 1: #!/bin/bash: No such file or directory
Queue: default
Reservation disabled
Num Nodes: 2
Num Switches: 0
Job dependency: None
Exec-Time: 00:10
Network: ethernet
Node memory: disabled
Tasks per Node: 16
Tasks in Master: 0
Master Port: 43306
Master WD: .
Worker WD: scratch
Library Path: .
Classpath: .
COMM: integratedtoolkit.nio.master.NIOAdaptor
To COMPSs: --lang=python /home/bsc21/bsc21863/simple.py 3

The submission seems correct. Ignore the print of log line 1: #!/bin/bash: No such file or directory. I do not know why it is appearing, but it doesn't stop the submission. Have you check if there was a job submitted after the command execution or some .out or .err files have been appeared later?

Related

How to fix Unsupported Config Type "" error in Hyperledger Fabric on Kubernetes?

I am trying to follow this tutorial on deploying Hyperledger Fabric on Kubernetes. But instead of IBM Cloud, I'm doing it with Google Cloud. I encountered this same issue (see my logs below) and tried:
changing docker image to docker:18.09-dind in docker.yaml.
setting FABRIC_CFG_PATH=$PWD/configFiles instead of FABRIC_CFG_PATH=$PWD in create_channel.yaml according to another StackOverflow answer.
However, these workaround did not work for me and I still encounter the error.
How do I fix this to be able to successfully deploy the network?
> ./setup_blockchainNetwork.sh
peersDeployment.yaml file was configured to use Docker in a container.
Creating Docker deployment
persistentvolume/docker-pv created
persistentvolumeclaim/docker-pvc created
service/docker created
deployment.apps/docker-dind created
Creating volume
The Persistant Volume does not seem to exist or is not bound
Creating Persistant Volume
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/createVolume.yaml
persistentvolume/shared-pv created
persistentvolumeclaim/shared-pvc created
Success creating Persistant Volume
Creating Copy artifacts job.
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/copyArtifactsJob.yaml
job.batch/copyartifacts created
Wating for container of copy artifact pod to run. Current status of copyartifacts-dcg4m is Pending
copyartifacts-dcg4m is now Running
Starting to copy artifacts in persistent volume.
Waiting for 10 more seconds for copying artifacts to avoid any network delay
Waiting for copyartifacts job to complete
Copy artifacts job completed
Generating the required artifacts for Blockchain network
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/generateArtifactsJob.yaml
job.batch/utils created
Waiting for generateArtifacts job to complete
Waiting for generateArtifacts job to complete
Creating Services for blockchain network
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/blockchain-services.yaml
service/blockchain-ca created
service/blockchain-orderer created
service/blockchain-org1peer1 created
service/blockchain-org2peer1 created
service/blockchain-org3peer1 created
service/blockchain-org4peer1 created
Creating new Deployment to create four peers in network
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/peersDeployment.yaml
deployment.apps/blockchain-orderer created
deployment.apps/blockchain-ca created
deployment.apps/blockchain-org1peer1 created
deployment.apps/blockchain-org2peer1 created
deployment.apps/blockchain-org3peer1 created
deployment.apps/blockchain-org4peer1 created
Checking if all deployments are ready
Waiting for 15 seconds for peers and orderer to settle
Creating channel transaction artifact and a channel
Running: kubectl create -f /home/me/blockchain-network-on-kubernetes/configFiles/create_channel.yaml
job.batch/createchannel created
Waiting for createchannel job to be completed
Waiting for createchannel job to be completed
Create Channel Failed
> kubectl get pods
NAME READY STATUS RESTARTS AGE
blockchain-ca-58b4bbbcc7-dqmnw 1/1 Running 0 30s
blockchain-orderer-ddc9466d-2sqt8 1/1 Running 0 30s
blockchain-org1peer1-ffbf698bb-fd6nf 1/1 Running 0 29s
blockchain-org2peer1-98f7fb5f9-mb5m7 1/1 Running 0 29s
blockchain-org3peer1-75d6b8bf5c-bxd24 1/1 Running 0 29s
blockchain-org4peer1-675669ffff-b4dxj 1/1 Running 0 29s
copyartifacts-dcg4m 0/1 Completed 0 60s
createchannel-9wt54 1/2 Error 0 12s
docker-dind-54767c54c5-crk7b 0/1 CrashLoopBackOff 3 73s
utils-wbpcz 0/2 Completed 0 37s
> kubectl logs createchannel-9wt54 -c createchanneltx
/shared
systemd-private-3cbb0a492497473087eda0bb66fbd738-systemd-networkd.service-QHqKfL
systemd-private-3cbb0a492497473087eda0bb66fbd738-systemd-resolved.service-NuNfWF
systemd-private-3cbb0a492497473087eda0bb66fbd738-systemd-timesyncd.service-SzE37R
2021-02-03 08:49:16.970 UTC [common.tools.configtxgen] main -> INFO 001 Loading configuration
2021-02-03 08:49:16.970 UTC [common.tools.configtxgen.localconfig] Load -> PANI 002 Error reading configuration: Unsupported Config Type ""
2021-02-03 08:49:16.970 UTC [common.tools.configtxgen] func1 -> PANI 003 Error reading configuration: Unsupported Config Type ""
panic: Error reading configuration: Unsupported Config Type "" [recovered]
panic: Error reading configuration: Unsupported Config Type ""
...

FABRIC_CFG_PATH setting is wrong.
Currently, your error is a phrase that occurs when there is a problem with the syntax in the configtx.yaml file or when the file path is wrong and cannot be found.
For configtxgen, refer to the configtx.yaml file under FABRIC_CFG_PATH.
In the tutorial you provided, configtx.yaml is not found under configFiles directory and it exists under artifacts directory.
I'll suggest two of the easiest solutions out of many.
move artifacts/configtx.yaml to configFiles/configtx.yaml
mv ./artifacts/configtx.yaml configFiles/configtx.yaml
Or, set FABRIC_CFG_PATH to configFiles
export FABRIC_CFG_PATH=${PWD}/artifacts

Kubernetes pod marked as `Completed` despite the exit code `255`

Situation:
I've got a CronJob that often fails (this is expected at the moment). Due to the fact that the container performing the job, has a side-car, the dependencies are between the containers are expressed through bash scripts and common mounts of emptyDir in /etc/liveness folder:
spec:
containers:
- args:
- -c
- set -x;
...
./process; # execute the main process
rc=$?;
rm /etc/liveness; # clean-up
exit $rc;
command:
- /bin/bash
Problem:
In the scenarios, where the job fails, I see the following in the logs:
+ rc=255
+ rm /etc/liveness
+ exit 255
With retryPolicy set to never, the failed pod enters the Completed status, which is misleading:
scheduler-1594015200-wl9xc 0/2 Completed 0 24m

According to official doc,
A Job creates one or more Pods and ensures that a specified number of
them successfully terminate.
And containers enter terminated state when
it has successfully completed execution or when it has failed for some
reason.
So if you set retryPolicy to never, this is what will happen.

A Pod's status field is a PodStatus object, which has a phase field.
Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase
Status and Phase is not the same. So I learned, that what happens above is that my pods end up in status Completed and phase Failed.

Kubernetes 1.15.5 and romana 2.0.2 getting network errors when ANY pods added or removed

I have encountered some mysterious network errors in our kubernetes cluster. Although I originally encountered these errors using ingress, there are even more errors when I bypass our load balancer, bypass kube-proxy and bypass nginx-ingress. The most errors are present when going directly to services and straight to the pod IPs. I believe this is because the load balancer and nginx have some better error handling than the raw iptable routing.
To test the error I use apache benchmark from VM on same subnet, any concurrency level, no keep-alive, connect to the pod IP and use a high enough request number to give me time to either scale up or scale down a deployment. Odd thing is it doesn't matter at all which deployment I modify since it always causes the same sets of errors even when its not related to the pod I am modifying. ANY additions or removals of pods will trigger apache benchmark errors. Manual deletions, scaling up/down, auto-scaling all trigger errors. If there are no pod changes while the ab test is running then no errors get reported. Note keep-alive does seem to greatly reduce if not eliminate the errors, but I only tested that a handful of times and never saw an error.
Other than some bizarre iptable conflict, I really don't see how deleting pod A can effect network connections of pod B. Since the errors are brief and go away within seconds it seems more like a brief network outage.
Sample ab test: ab -n 5000 -c 2 https://10.112.0.24/
Errors when using HTTPS:
SSL handshake failed (5).
SSL read failed (5) - closing connection
Errors when using HTTP:
apr_socket_recv: Connection reset by peer (104)
apr_socket_recv: Connection refused (111)
Example ab output. I ctl-C after encountering first errors:
$ ab -n 5000 -c 2 https://10.112.0.24/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 10.112.0.24 (be patient)
Completed 500 requests
Completed 1000 requests
SSL read failed (5) - closing connection
Completed 1500 requests
^C
Server Software: nginx
Server Hostname: 10.112.0.24
Server Port: 443
SSL/TLS Protocol: TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
Document Path: /
Document Length: 2575 bytes
Concurrency Level: 2
Time taken for tests: 21.670 seconds
Complete requests: 1824
Failed requests: 2
(Connect: 0, Receive: 0, Length: 1, Exceptions: 1)
Total transferred: 5142683 bytes
HTML transferred: 4694225 bytes
Requests per second: 84.17 [#/sec] (mean)
Time per request: 23.761 [ms] (mean)
Time per request: 11.881 [ms] (mean, across all concurrent requests)
Transfer rate: 231.75 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 5 15 9.8 12 82
Processing: 1 9 9.0 6 130
Waiting: 0 8 8.9 6 129
Total: 7 23 14.4 19 142
Percentage of the requests served within a certain time (ms)
50% 19
66% 24
75% 28
80% 30
90% 40
95% 54
98% 66
99% 79
100% 142 (longest request)
Current sysctl settings that may be relevant:
net.netfilter.nf_conntrack_tcp_be_liberal = 1
net.nf_conntrack_max = 131072
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_count = 1280
net.ipv4.ip_local_port_range = 27050 65500
I didn't see any conntrack "full" errors. Best I could tell there isn't packet loss. We recently upgraded from 1.14 and didn't notice the issue but I can't say for certain it wasn't there. I believe we will be forced to migrate away from romana soon since it doesn't seem to be maintained anymore and as we upgrade to kube 1.16.x we are encountering problems with it starting up.
I have searched the internet all day today looking for similar problems and the closest one that resembles our problem is https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 but I have no idea how to implement the iptable masquerade --random-fully option given we use romana and I read (https://github.com/kubernetes/kubernetes/pull/78547#issuecomment-527578153) that random-fully is the default for linux kernel 5 which we are using. Any ideas?
kubernetes 1.15.5
romana 2.0.2
centos7
Linux kube-master01 5.0.7-1.el7.elrepo.x86_64 #1 SMP Fri Apr 5 18:07:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
====== Update Nov 5, 2019 ======
It has been suggested to test an alternate CNI. I chose calico since we used that in an older Debian based kube cluster. I rebuilt a VM with our most basic Centos 7 template (vSphere) so there is a little baggage coming from our customizations. I can't list everything we customized in our template but the most notable change is the kernel 5 upgrade yum --enablerepo=elrepo-kernel -y install kernel-ml.
After starting up the VM these are the minimal steps to install kubernetes and run the test:
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
yum -y install docker-ce-3:18.09.6-3.el7.x86_64
systemctl start docker
cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF
# Set SELinux in permissive mode (effectively disabling it)
setenforce 0
sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config
echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
yum install -y kubeadm-1.15.5-0 kubelet-1.15.5-0 kubectl-1.15.5-0
systemctl enable --now kubelet
kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
cat <<EOF > /tmp/test-deploy.yml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: test
spec:
selector:
matchLabels:
app: test
replicas: 1
template:
metadata:
labels:
app: test
spec:
containers:
- name: nginx
image: nginxdemos/hello
ports:
- containerPort: 80
EOF
# wait for control plane to become healthy
kubectl apply -f /tmp/test-deploy.yml
Now the setup is ready and this is the ab test:
$ docker run --rm jordi/ab -n 100 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient)...apr_pollset_poll: The timeout specified has expired (70007)
Total of 11 requests completed
The ab test gives up after this error. If I decrease the number of requests to see avoid the timeout this is what you would see:
$ docker run --rm jordi/ab -n 10 -c 1 http://192.168.4.4/
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.4 (be patient).....done
Server Software: nginx/1.13.8
Server Hostname: 192.168.4.4
Server Port: 80
Document Path: /
Document Length: 7227 bytes
Concurrency Level: 1
Time taken for tests: 0.029 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 74140 bytes
HTML transferred: 72270 bytes
Requests per second: 342.18 [#/sec] (mean)
Time per request: 2.922 [ms] (mean)
Time per request: 2.922 [ms] (mean, across all concurrent requests)
Transfer rate: 2477.50 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.8 1 3
Processing: 1 2 1.2 1 4
Waiting: 0 1 1.3 0 4
Total: 1 3 1.4 3 5
Percentage of the requests served within a certain time (ms)
50% 3
66% 3
75% 4
80% 5
90% 5
95% 5
98% 5
99% 5
100% 5 (longest request)
This issue is technically different than the original issue I reported but this is a different CNI and there are still network issues. It does have the timeout error in common when I run the same test in the kube/romana cluster: run the ab test on the same node as the pod. Both encountered the same timeout error but in romana I could get a few thousand requests to finish before hitting the timeout. Calico encounters the timeout error before reaching a dozen requests.
Other variants or notes:
- net.netfilter.nf_conntrack_tcp_be_liberal=0/1 doesn't seem to make a difference
- higher -n values sometimes work but it is largely random.
- running the 'ab' test at low -n values several times in a row can sometimes trigger the timeout
At this point I am pretty sure it is some issue with our centos installation but I can't even guess what it could be. Are there any other limits, sysctl or other configs that could cause this?
====== Update Nov 6, 2019 ======
I observer that we had an older kernel installed in so I upgraded my kube/calico test VM with the same newer kernel 5.3.8-1.el7.elrepo.x86_64. After the update and a few reboots I can no longer reproduce the "apr_pollset_poll: The timeout specified has expired (70007)" timout errors.
Now that the timeout error is gone I was able to repeat the original test where I load test pod A and kill pod B on my vSphere VMs. On the romana environments the problem still existed but only when the load test is on a different host than where the pod A is located. If I run the test on the same host, no errors at all. Using Calico instead of romana, there are no load test errors on either host so the problem was gone. There may still be some setting to tweak that can help romana but I think this is "strike 3" for romana so I will start transitioning a full environment to Calico and do some acceptance testing there to ensure there are no hidden gotchas.

You mentioned that if there are no pod changes while the ab test is running, then no errors get reported. So it means that errors occur when you add pod or delete one.
This is normal behaviour as when pod gets deleted; it takes time for iptable rules changes to propagate. It may happen that container got removed, but iptable rules haven't got changed yet end packets are being forwarded to the nonexisting container, and this causes errors (it is sort of like a race condition).
The first thing you can do is always to create readiness probe as it will make sure that traffic will not be forwarded to the container until it is ready to handle requests.
The second thing to do is to handle deleting the container properly. This is a bit harder task because it may be handled at many levels, but the easiest thing you can do is adding PreStop hook to your container like this:
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "sleep 5"
PreStop hook gets executed at the moment of the pod deletion request. From this moment, k8s start changing iptable rules and it should stop forwarding new traffic to the container that's about to get deleted. While sleeping you give some time for k8s to propagate iptable changes in the cluster while not interrupting already existing connections. After PreStop handle exits, the container will receive SIGTERM signal.
My suggestion would be to apply both of these mechanisms together and check if it helps.
You also mentioned that bypassing ingress is causing more errors. I would assume that this is due to the fact that ingress has implemented retries mechanism. If it's unable to open a connection to a container, it will try several times, and hopefully will get to a container that can handle its request.

mesos-master crash with zookeeper cluster

I am deploying a zookeeper cluster which has 3 nodes. I use it to keep my mesos master high availability. I download the zookeeper-3.4.6.tar.gz tarball and uncompress it to /opt, rename it to /opt/zookeeper, enter the directory, edit the conf/zoo.cfg(pasted below), create a myid file in dataDir(which is set to /var/lib/zookeeper in zoo.cfg), and start zookeeper using ./bin/zkServer.sh start, and it goes well. I start all the 3 nodes one by one and they all seems well. I use ./bin/zkCli.sh to connect the server , no problem.
But when I start mesos (3 masters and 3 slaves, each node runs a master and a slave), then the masters soon crashed, one by one, and in the webpage http://mesos_master:5050, slave tab, no slaves are displayed. But when I run only one zookeeper, these are all fine. So I think it's the zookeeper cluster's problem.
I got 3 PV host in my ubuntu server. they are all running ubuntu 14.04 LTS:
node-01, node-02, node-03,
I have /etc/hosts in all three nodes like this:
172.16.2.70 node-01
172.16.2.81 node-02
172.16.2.80 node-03
I installed zookeeper, mesos on all the three nodes. Zookeeper configure file is like this (all three nodes) :
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=node-01:2888:3888
server.2=node-02:2888:3888
server.3=node-03:2888:3888
they can be started normally and run well. And then I start the mesos-master service, using the command line ./bin/mesos-master.sh --zk=zk://172.16.2.70:2181,172.16.2.81:2181,172.16.2.80:2181/mesos --work_dir=/var/lib/mesos --quorum=2, and after a few seconds, it gives me errors like this:
F0817 15:09:19.995256 2250 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
# 0x7fa2b8be71a2 google::LogMessage::Fail()
# 0x7fa2b8be70ee google::LogMessage::SendToLog()
# 0x7fa2b8be6af0 google::LogMessage::Flush()
# 0x7fa2b8be9a04 google::LogMessageFatal::~LogMessageFatal()
▽
# 0x7fa2b81a899a mesos::internal::master::fail()
▽
# 0x7fa2b8262f8f _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
▽
# 0x7fa2b823fba7 _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
# 0x7fa2b820f9f3 _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
# 0x7fa2b826305c _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
# 0x4a44e7 std::function<>::operator()()
# 0x49f3a7 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
# 0x499480 process::Future<>::fail()
# 0x7fa2b806b4b4 process::Promise<>::fail()
# 0x7fa2b826011b process::internal::thenf<>()
# 0x7fa2b82a0757 _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
# 0x7fa2b82962d9 std::_Bind<>::operator()<>()
# 0x7fa2b827ee89 std::_Function_handler<>::_M_invoke()
I0817 15:09:20.098639 2248 http.cpp:283] HTTP GET for /master/state.json from 172.16.2.84:54542 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'
# 0x7fa2b8296507 std::function<>::operator()()
# 0x7fa2b827efaf _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
# 0x7fa2b82a07fe _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
# 0x7fa2b8296507 std::function<>::operator()()
# 0x7fa2b82e4419 process::internal::run<>()
# 0x7fa2b82da22a process::Future<>::fail()
# 0x7fa2b83136b5 std::_Mem_fn<>::operator()<>()
# 0x7fa2b830efdf _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
# 0x7fa2b8307d7f _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
# 0x7fa2b82fe431 _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
# 0x7fa2b830f065 _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
# 0x4a44e7 std::function<>::operator()()
# 0x49f3a7 _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
# 0x7fa2b82da202 process::Future<>::fail()
# 0x7fa2b82d2d82 process::Promise<>::fail()
Aborted
sometimes the warning is like this, and then crashed with the same output above:
0817 15:09:49.745750 2104 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I want to know whether zookeeper is deployed and run well in my case, and How can I locate where the problem is. Any answers and suggests are welcomed. thanks.

Actually, in my case, It's because I didn't open firewall port 5050 to allow three servers to communicate with each others. After updating firewall rule, it starts to work as expected.

I fall into same issue, I tried different ways and different options and finally --ip option worked for me. Initially I used --hostname option
mesos-master --ip=192.168.0.13 --quorum=2 --zk=zk://m1:2181,m2:2181,m3:2181/mesos --work_dir=/opt/mm1 --log_dir=/opt/mm1/logs

You need to check that all mesos/zookeeper master nodes can communicate correctly. For that, you need:
Zookeeper ports open: TCP 2181, 2888, 3888
Mesos port open: TCP 5050
ping available (ICMP message 0 and 8)
If you use FQDN instead of IP in your config, check that the DNS resolution is working correctly as well.

Split your mesos masters' work_dir to different dir, do not use a share work_dir for all masters, because of zk

log file shows intermittent success and failure

I will try a brief version first, then i can add more information as requested.
I have a client machine with the following configuration:
------------------------------------------------------------
Connected to puppet-client-10 as root
Debian 7.8 wheezy (amd64)
------------------------------------------------------------
FQDN : puppet-client-10.mydomain
IP : 161.148.1.10
PuppetMaster: puppet-master.mydomain
Puppet : 3.7.5
Facter : 2.2.0
------------------------------------------------------------
Connecting to the below puppetmaster:
------------------------------------------------------------
Connected to puppet-master as root
Debian 7.8 wheezy (amd64)
------------------------------------------------------------
FQDN : puppet-master.mydomain
IP : 161.148.1.1
Puppet : 3.7.5
Facter : 2.4.3
------------------------------------------------------------
Now, back to the client.
I used to have the agent disabled, and checking updates via cron once a day.
6 22 * * * root /usr/bin/puppet agent --test --logdest syslog
Works flawlessly.
2 days ago i commented the cron job and enabled the agent to check for updates every hour.
Then, the logs started showing this line every 2 minutes
<27>1 2015-05-20T08:20:30.651767-03:00 puppet-client-10 puppet-agent 8072 - - Could not request certificate: getaddrinfo: Name or service not known
<27>1 2015-05-20T08:22:30.668988-03:00 puppet-client-10 puppet-agent 8072 - - Could not request certificate: getaddrinfo: Name or service not known
Also, is showing that the client is correctly checking the master for updates
<28>1 2015-05-20T08:23:44.927447-03:00 puppet-client-10 puppet-agent 31500 - - Loading class elasticsearch
<28>1 2015-05-20T08:23:45.406158-03:00 puppet-client-10 puppet-agent 31500 - - Loading class logstash
<28>1 2015-05-20T08:23:45.776948-03:00 puppet-client-10 puppet-agent 31500 - - Loading class logrotate
<28>1 2015-05-20T08:23:46.204161-03:00 puppet-client-10 puppet-agent 31500 - - Loading class puppet
And then, back to the getaddrinfo error every 2 minutes
<27>1 2015-05-20T08:24:30.676307-03:00 puppet-client-10 puppet-agent 8072 - - Could not request certificate: getaddrinfo: Name or service not known
<27>1 2015-05-20T08:26:30.683570-03:00 puppet-client-10 puppet-agent 8072 - - Could not request certificate: getaddrinfo: Name or service not known
It keeps alternating between the error (every 2 minutes) and success (every hour) messages.
Executing the command puppet agent --test works, as expected.
The problem seems to be on the agent.
Any hints?
i would guess it is because your puppet master isn't named "puppet".
Also I'd check what user the puppet agent you now have running is
running as, probably not root I'd guess – Vorsprung
it is named puppet-master, also puppet-master.mydomain, and with the below alt names
# puppet cert list puppet-master.mydomain
+ "puppet-master.mydomain" (SHA256) F2:54:03:9C
(alt names: "DNS:puppet", "DNS:puppet.mydomain", "DNS:puppet-master.mydomain")
It is running as root
# ps aux | grep puppet
root 1763 0.0 0.2 133776 45236 ? Ssl Mai19 0:07 /usr/bin/ruby /usr/bin/puppet agent
root 8072 0.0 0.2 194580 40144 ? Ssl Mai19 0:02 /usr/bin/ruby /usr/bin/puppet agent
Right now, 8072 above is the process spamming the error line.
Should i really have 2 processes running?

The error indicates an issue resolving a hostname to an IP, but given it succeeds every hour and also succeeds manually I don't think you have any configuration issues with your name resolution.
You should only have a single puppet-agent process running, I would stop the puppet-agent service, ensure that all of the processes have been killed, restart the puppet-agent service and ensure that only one process is running.
My bet is on one of those processes doing something silly.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse