I am trying to deploy a CEPH cluster with 4 nodes.
admin, node1, node2, client
I am using Oracle VM VIrtual Box to deploy centos 8 nodes.
I am using CEPH Octopus.
I am using ansible-playbook to deploy the cluster
I followed the steps listed in this blog to deploy my cluster:
https://computingforgeeks.com/install-and-configure-ceph-storage-cluster-on-centos-linux/?amp#ex1
[mons]
admin monitor_address=192.168.1.75
[mdss]
admin
[osds]
node1
node2
[clients]
node3
[mgrs]
admin
[grafana-server]
node2
but every time the process gets halted and I am receiving the following error:
TASK [ceph-osd : use ceph-volume lvm batch to create bluestore osds] ***********
Wednesday 21 July 2021 11:29:51 +0530 (0:00:00.085) 0:02:06.609 ***********
fatal: [node2]: FAILED! => changed=true
cmd:
- ceph-volume
- --cluster
- ceph
- lvm
- batch
- --bluestore
- --yes
- --osds-per-device
- '4'
- /dev/sdb
delta: '0:50:01.301186'
end: '2021-06-07 16:19:53.554904'
msg: non-zero return code
rc: 1
start: '2021-06-07 15:29:52.253718'
stderr: |-
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.25
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 4f1d0cc7-0363-4d06-84e8-c45be914159b
stderr: 2021-06-07T15:34:53.415+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:39:53.418+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:44:53.420+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:49:53.422+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:54:53.423+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:59:53.425+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:04:53.427+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:09:53.428+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:14:53.430+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:19:53.432+0530 7f018bcde700 0 monclient(hunting): authenticate timed out after 300
stderr: [errno 110] RADOS timed out (error connecting to the cluster)
Traceback (most recent call last):
File "/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 415, in main
self._execute(plan)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 434, in _execute
c.create(argparse.Namespace(**args))
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py", line 26, in create
prepare_step.safe_prepare(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 252, in safe_prepare
self.prepare()
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 292, in prepare
self.osd_id = prepare_utils.create_id(osd_fsid, json.dumps(secrets), osd_id=self.args.osd_id)
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 173, in create_id
raise RuntimeError('Unable to create a new OSD id')
RuntimeError: Unable to create a new OSD id
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
fatal: [node1]: FAILED! => changed=true
cmd:
- ceph-volume
- --cluster
- ceph
- lvm
- batch
- --bluestore
- --yes
- --osds-per-device
- '4'
- /dev/sdb
delta: '0:50:02.352739'
end: '2021-06-07 16:19:54.591371'
msg: non-zero return code
rc: 1
start: '2021-06-07 15:29:52.238632'
stderr: |-
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.25
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 8b324b40-b7d4-4e97-9178-e0f987fd3b67
stderr: 2021-06-07T15:34:53.504+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:39:53.531+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:44:53.560+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:49:53.775+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:54:53.799+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T15:59:53.820+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:04:53.846+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:09:53.849+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:14:54.327+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: 2021-06-07T16:19:54.476+0530 7fbccc9f2700 0 monclient(hunting): authenticate timed out after 300
stderr: [errno 110] RADOS timed out (error connecting to the cluster)
Traceback (most recent call last):
File "/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 415, in main
self._execute(plan)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/batch.py", line 434, in _execute
c.create(argparse.Namespace(**args))
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py", line 26, in create
prepare_step.safe_prepare(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 252, in safe_prepare
self.prepare()
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 292, in prepare
self.osd_id = prepare_utils.create_id(osd_fsid, json.dumps(secrets), osd_id=self.args.osd_id)
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 173, in create_id
raise RuntimeError('Unable to create a new OSD id')
RuntimeError: Unable to create a new OSD id
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
Please help me with this. I was successful only 2 times out of the 10 times I tried.
I used the same configuration every time.
Change OSD nodes to 3 nodes and try again because of quorum, or you can activate test mode from this config file:
group_vars/all.yml ---> change ceph_test: false to ceph_test: true
Related
I am trying to create a Perl script with Net::FTPSSL that move the files from the FTP server to actual server.
The script will run every X minutes / seconds to check if the customers uploaded something new on the FTP server and move it.
#!/usr/bin/perl -w
use strict;
use warnings;
use DateTime;
use Net::FTPSSL;
my $ftp;
sub connect_ftp {
$ftp = Net::FTPSSL->new('sftp.domain.com',
Encryption => EXP_CRYPT,
Debug => 1, DebugLogFile => "/usr/logs/ftpssl.log");
$ftp->login('username', 'password');
$ftp->binary();
}
sub copy_files {
my #folder_list;
my $folder;
#folder_list = $ftp->nlst();
foreach $folder (#folder_list) {
if ($folder ne "globa_username") {
$ftp->cwd($folder);
my #file_list;
my $video;
#file_list = $ftp->nlst();
foreach $video (#file_list) {
my $file_status;
my $result;
if ($video =~ /\.mp4$/i) {
if (!file_exist($video, $folder)) {
$file_status = $ftp->is_file($video);
if ($file_status > 0) {
$result = $ftp->get($video, "/usr/media/$folder/$video");
if (defined $result) {
#$ftp->delete($video);
}
}
}
}
}
}
$ftp->cdup();
}
#$ftp->quit();
}
sub file_exist {
my ($video, $folder) = #_;
return -f "/usr/media/$folder/$video";
}
connect_ftp();
for (;;) {
copy_files();
#sleep 60;
#sleep 30;
}
The script works fine until returns: **555 TLSv12: SSL connect attempt failed because of handshake problems**.
Net-FTPSSL Version: 0.42
IO-Socket-SSL Version: 1.94
Net-SSLeay Version: 1.55
IO-Socket-INET Version: 1.33
IO-Socket-INET6 might not be installed.
IO-Socket-IP Version: 0.21
IO Version: 1.25_06
Socket Version: 2.030
No IPv6 support available. You must 1st upgrade IO::Socket::SSL to support it!
Perl: 5.016003 [5.16.3], OS: linux
***** IPv6 not yet supported in Net::FTPSSL! *****
Server (port): sftp.domain.com (21)
Keys: (Debug), (Encryption), (DebugLogFile)
Values: (1), (E), (/usr/logs/ftpssl.log)
SKT <<< 220 (vsFTPd 3.0.2)
SKT >>> AUTH TLS
SKT <<< 234 Proceed with negotiation.
Object HASH Details ... (SSL:arguments - E)
SSL_hostname ==> sftp.domain.com
SSL_verify_mode ==> 0
SSL_version ==> TLSv12
Timeout ==> 120
Object Net::FTPSSL Details ... (sftp.domain.com:21 - E)
_FTPSSL_arguments ==> HASH(0x272cee8)
-- Croak ===> (undef)
-- Crypt ===> E
-- FixGetTs ===> 0
-- FixPutTs ===> 0
-- Host ===> sftp.domain.com
-- Pret ===> 0
-- Timeout ===> 120
-- buf_size ===> 10240
-- data_prot ===> P
-- dcsc_mode ===> 1
-- debug ===> 2
-- debug_extra ===> 0
-- debug_no_help ===> 0
-- ftpssl_filehandle ===> GLOB(0x2a6d8c8)
-- last_ftp_msg ===> 234 Proceed with negotiation.
-- myContext ===> HASH(0x2747d18)
-- SSL_ca_file ----> /etc/pki/tls/certs/ca-bundle.crt
-- SSL_reuse_ctx ----> IO::Socket::SSL::SSL_Context=HASH(0x2a8f1f0)
-- context ++++> 44543648
-- mySocketOpts ===> HASH(0x2746ec0)
-- PeerAddr ----> sftp.domain.com
-- PeerPort ----> 21
-- Proto ----> tcp
-- Timeout ----> 120
-- start_SSL_opts ===> HASH(0x272c840)
-- SSL_hostname ----> sftp.domain.com
-- SSL_verify_mode ----> 0
-- SSL_version ----> TLSv12
-- Timeout ----> 120
-- trace ===> 0
-- type ===> A
_SSL_arguments ==> HASH(0x2a8e770)
-- PeerAddr ===> 33.44.55.66
-- PeerPort ===> 21
-- Proto ===> tcp
-- SSL_ca_file ===> /etc/pki/tls/certs/ca-bundle.crt
-- SSL_cert_file ===> certs/client-cert.pem
-- SSL_check_crl ===> 0
-- SSL_honor_cipher_order ===> 0
-- SSL_hostname ===> sftp.domain.com
-- SSL_key_file ===> certs/client-key.pem
-- SSL_server ===> 0
-- SSL_use_cert ===> 0
-- SSL_verify_mode ===> 0
-- SSL_version ===> TLSv12
_SSL_ctx ==> IO::Socket::SSL::SSL_Context=HASH(0x2a8f1f0)
-- context ===> 44543648
_SSL_fileno ==> 4
_SSL_ioclass_upgraded ==> IO::Socket::INET
_SSL_last_err ==> SSL wants a read first
_SSL_object ==> 44657040
_SSL_opened ==> 1
io_socket_domain ==> 2
io_socket_proto ==> 6
io_socket_timeout ==> 120
io_socket_type ==> 1
************************************************************
>>> USER +++++++
<<< 331 Please specify the password.
>>> PASS *******
<<< 230 Login successful.
>>> HELP
<<< 214-The following commands are recognized.
<<< ABOR ACCT ALLO APPE CDUP CWD DELE EPRT EPSV FEAT HELP LIST MDTM MKD
<<< MODE NLST NOOP OPTS PASS PASV PORT PWD QUIT REIN REST RETR RMD RNFR
<<< RNTO SITE SIZE SMNT STAT STOR STOU STRU SYST TYPE USER XCUP XCWD XMKD
<<< XPWD XRMD
<<< 214 Help OK.
>>> FEAT
<<< 211-Features:
<<< AUTH SSL
<<< AUTH TLS
<<< EPRT
<<< EPSV
<<< MDTM
<<< PASV
<<< PBSZ
<<< PROT
<<< REST STREAM
<<< SIZE
<<< TVFS
<<< UTF8
<<< 211 End
<<+ 111 Auto-adding OPTS Command!
>>> HELP SITE
<<< 214-The following commands are recognized.
<<< ABOR ACCT ALLO APPE CDUP CWD DELE EPRT EPSV FEAT HELP LIST MDTM MKD
<<< MODE NLST NOOP OPTS PASS PASV PORT PWD QUIT REIN REST RETR RMD RNFR
<<< RNTO SITE SIZE SMNT STAT STOR STOU STRU SYST TYPE USER XCUP XCWD XMKD
<<< XPWD XRMD
<<< 214 Help OK.
>>> SITE HELP
<<< 214 CHMOD UMASK HELP
<<+ 214 The HELP command is supported.
>>> TYPE I
<<< 200 Switching to Binary mode.
<<< 200 Switching to Binary mode.
>>> PBSZ 0
<<< 200 PBSZ set to 0.
>>> PROT P
<<< 200 PROT now Private.
>>> PASV
<<< 227 Entering Passive Mode (33,44,55,66,20,207).
--- Host (33.44.55.66) Port (5327)
>>> NLST
<<< 150 Here comes the directory listing.
<<< 226 Directory send OK.
>>> CWD demo3
<<< 250 Directory successfully changed.
>>> PBSZ 0
<<< 200 PBSZ set to 0.
>>> PROT P
<<< 200 PROT now Private.
>>> PASV
<<< 227 Entering Passive Mode (33,44,55,66,20,144).
--- Host (33.44.55.66) Port (5264)
>>> NLST
<<< 150 Here comes the directory listing.
<<< 226 Directory send OK.
>>> CDUP
<<< 250 Directory successfully changed.
>>> CDUP
<<< 250 Directory successfully changed.
>>> CWD demo2
<<< 250 Directory successfully changed.
>>> PBSZ 0
<<< 200 PBSZ set to 0.
>>> PROT P
<<< 200 PROT now Private.
>>> PASV
<<< 227 Entering Passive Mode (33,44,55,66,20,179).
--- Host (33.44.55.66) Port (5299)
>>> NLST
<<< 150 Here comes the directory listing.
<<< 226 Directory send OK.
>>> CDUP
<<< 250 Directory successfully changed.
>>> CWD demo
<<< 250 Directory successfully changed.
>>> PBSZ 0
<<< 200 PBSZ set to 0.
>>> PROT P
<<< 200 PROT now Private.
>>> PASV
...
<<< 227 Entering Passive Mode (33,44,55,66,19,196).
--- Host (33.44.55.66) Port (5060)
>>> NLST
<<< 150 Here comes the directory listing.
<<+ 555 TLSv12: SSL connect attempt failed because of handshake problems
>>> PBSZ 0
<<+ 555 Unexpected EOF on Command Channel [0] (0, 1) ()
>>> PROT P
<<+ 555 Can't write command on socket:
UPDATE:
Tried downgrading SSL_version to SSLv3 but it still doesn't work.
$ftp = Net::FTPSSL->new('sftp.domain.com',
Encryption => EXP_CRYPT,
Debug => 1, DebugLogFile => "/usr/logs/ftpssl.log", SSL_version => "SSLv3");
Net::FTP
I created another script using Net::FTP it's works fine until:
Net::FTP>>> Net::FTP(3.11)
Net::FTP>>> Exporter(5.68)
Net::FTP>>> Net::Cmd(3.11)
Net::FTP>>> IO::Socket::SSL(2.072)
Net::FTP>>> IO::Socket::INET(1.33)
Net::FTP>>> IO::Socket(1.34)
Net::FTP>>> IO::Handle(1.33)
Net::FTP=GLOB(0x11d7400)<<< 220 (vsFTPd 3.0.2)
Net::FTP=GLOB(0x11d7400)>>> AUTH TLS
Net::FTP=GLOB(0x11d7400)<<< 234 Proceed with negotiation.
Net::FTP=GLOB(0x11d7400)>>> PBSZ 0
Net::FTP=GLOB(0x11d7400)<<< 200 PBSZ set to 0.
Net::FTP=GLOB(0x11d7400)>>> PROT P
Net::FTP=GLOB(0x11d7400)<<< 200 PROT now Private.
Net::FTP=GLOB(0x11d7400)>>> USER username
Net::FTP=GLOB(0x11d7400)<<< 331 Please specify the password.
Net::FTP=GLOB(0x11d7400)>>> PASS ....
Net::FTP=GLOB(0x11d7400)<<< 230 Login successful.
Net::FTP=GLOB(0x11d7400)>>> TYPE I
Net::FTP=GLOB(0x11d7400)<<< 200 Switching to Binary mode.
Net::FTP=GLOB(0x11d7400)>>> PASV
Net::FTP=GLOB(0x11d7400)<<< 227 Entering Passive Mode (33,44,55,66,21,107).
Net::FTP=GLOB(0x11d7400)>>> NLST
Net::FTP=GLOB(0x11d7400)<<< 150 Here comes the directory listing.
Net::FTP=GLOB(0x11d7400)<<< 226 Directory send OK.
Net::FTP=GLOB(0x11d7400)>>> CWD demo
Net::FTP=GLOB(0x11d7400)<<< 250 Directory successfully changed.
Net::FTP=GLOB(0x11d7400)>>> PASV
...
Net::FTP=GLOB(0x13e2400)<<< 150 Here comes the directory listing.
Net::FTP=GLOB(0x13e2400)<<< 226 Directory send OK.
Net::FTP=GLOB(0x13e2400)>>> CWD demo3
Net::FTP=GLOB(0x13e2400)<<< 250 Directory successfully changed.
Net::FTP=GLOB(0x13e2400)>>> PASV
Net::FTP=GLOB(0x13e2400)<<< 227 Entering Passive Mode (33,44,55,66,19,196).
Net::FTP=GLOB(0x13e2400)>>> NLST
failed to ssl upgrade dataconn: SSL wants a read first at ./upload_ftp.pl line 40.
Net::FTP=GLOB(0x13e2400)<<< 150 Here comes the directory listing.
Net::FTP=GLOB(0x13e2400)>>> CDUP
Net::FTP: Net::Cmd::getline(): unexpected EOF on command channel: at ./upload_ftp.pl line 56.
Net::FTP: Net::Cmd::_is_closed(): unexpected EOF on command channel: at ./upload_ftp.pl line 56.
I have the following simplified design: a mongodb container and a "python-client" docker container, that is linked to the former. This is my simplified docker-compose.yml file:
mongodb:
build: "mongodb"
dockerfile: "Dockerfile"
hostname: "mongodb.local"
ports:
- "27017:27017"
client:
build: "client"
dockerfile: "Dockerfile"
hostname: "client.local"
links:
- "mongodb:mongodb"
environment:
- "MONGODB_URL=mongodb://admin:admin#mongodb:27017/admin"
- "MONGODB_DB=historictraffic"
I'm able to establish a successful connection using pymongo from my host using the mongodb://admin:admin#localhost:27017/admin connection string (pay attention to localhost):
$ ipython
from pymongo import MongoClient
mongo = MongoClient('mongodb://admin:admin#localhost:27017/admin')
db = mongo.test
col = db.test
col.insert_one({'x': 1})
# This works
But I can't connect from the client container. Apparently the link is correct:
/ # cat /etc/hosts
172.17.0.27 client.local client
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.26 historictraffic_mongodb_1 mongodb
172.17.0.26 mongodb mongodb historictraffic_mongodb_1
172.17.0.26 mongodb_1 mongodb historictraffic_mongodb_1
But when I do the same test, it fails:
/ # ipython
from pymongo import MongoClient
mongo = MongoClient('mongodb://admin:admin#mongodb:27017/admin')
db = mongo.test
col = db.test
col.insert_one({'x': 2})
---------------------------------------------------------------------------
ServerSelectionTimeoutError Traceback (most recent call last)
<ipython-input-5-c5d62e5590d5> in <module>()
----> 1 col.insert_one({'x': 2})
/usr/lib/python2.7/site-packages/pymongo/collection.pyc in insert_one(self, document)
464 if "_id" not in document:
465 document["_id"] = ObjectId()
--> 466 with self._socket_for_writes() as sock_info:
467 return InsertOneResult(self._insert(sock_info, document),
468 self.write_concern.acknowledged)
/usr/lib/python2.7/contextlib.pyc in __enter__(self)
15 def __enter__(self):
16 try:
---> 17 return self.gen.next()
18 except StopIteration:
19 raise RuntimeError("generator didn't yield")
/usr/lib/python2.7/site-packages/pymongo/mongo_client.pyc in _get_socket(self, selector)
661 #contextlib.contextmanager
662 def _get_socket(self, selector):
--> 663 server = self._get_topology().select_server(selector)
664 try:
665 with server.get_socket(self.__all_credentials) as sock_info:
/usr/lib/python2.7/site-packages/pymongo/topology.pyc in select_server(self, selector, server_selection_timeout, address)
119 return random.choice(self.select_servers(selector,
120 server_selection_timeout,
--> 121 address))
122
123 def select_server_by_address(self, address,
/usr/lib/python2.7/site-packages/pymongo/topology.pyc in select_servers(self, selector, server_selection_timeout, address)
95 if server_timeout == 0 or now > end_time:
96 raise ServerSelectionTimeoutError(
---> 97 self._error_message(selector))
98
99 self._ensure_opened()
ServerSelectionTimeoutError: mongodb:27017: [Errno 113] Host is unreachable
Does anyone know how to solve it? Thank you.
This failure is probably because mongo hasn't started up yet. You can retry the connection with a short delay between retries, and it should work after one or two attempts.
I was using IPython notebook to run PySpark with just adding the following to the notebook:
import os
os.chdir('../data_files')
import sys
import pandas as pd
%pylab inline
from IPython.display import Image
os.environ['SPARK_HOME']="spark-1.3.1-bin-hadoop2.6"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.8.2.1-src.zip') )
from pyspark import SparkContext
sc = SparkContext('local')
This worked fine for one project. but on my second project, after running a couple of lines (not the same every time), I get the following error:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py", line 425, in start
self.socket.connect((self.address, self.port))
File "/usr/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
---------------------------------------------------------------------------
Py4JNetworkError Traceback (most recent call last)
<ipython-input-21-4626925bbe8f> in <module>()
----> 1 words.count()
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in count(self)
930 3
931 """
--> 932 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
933
934 def stats(self):
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in sum(self)
921 6.0
922 """
--> 923 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
924
925 def count(self):
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in reduce(self, f)
737 yield reduce(f, iterator, initial)
738
--> 739 vals = self.mapPartitions(func).collect()
740 if vals:
741 return reduce(f, vals)
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.pyc in collect(self)
710 Return a list that contains all of the elements in this RDD.
711 """
--> 712 with SCCallSiteSync(self.context) as css:
713 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
714 return list(_load_from_socket(port, self._jrdd_deserializer))
/home/eee/Desktop/NLP/spark-1.3.1-bin-hadoop2.6/python/pyspark/traceback_utils.pyc in __enter__(self)
70 def __enter__(self):
71 if SCCallSiteSync._spark_stack_depth == 0:
---> 72 self._context._jsc.setCallSite(self._call_site)
73 SCCallSiteSync._spark_stack_depth += 1
74
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args)
534 END_COMMAND_PART
535
--> 536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
538 self.target_id, self.name)
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in send_command(self, command, retry)
360 the Py4J protocol.
361 """
--> 362 connection = self._get_connection()
363 try:
364 response = connection.send_command(command)
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in _get_connection(self)
316 connection = self.deque.pop()
317 except Exception:
--> 318 connection = self._create_connection()
319 return connection
320
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in _create_connection(self)
323 connection = GatewayConnection(self.address, self.port,
324 self.auto_close, self.gateway_property)
--> 325 connection.start()
326 return connection
327
/usr/local/lib/python2.7/dist-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.pyc in start(self)
430 'server'
431 logger.exception(msg)
--> 432 raise Py4JNetworkError(msg)
433
434 def close(self):
Py4JNetworkError: An error occurred while trying to connect to the Java server
Once this happens, other lines working before now raise the same problem,
any ideas?
Specifications for:
pyspark 1.4.1
ipython 4.0.0
[OSX / homebrew]
If you want to launch pyspark within a Jupyter (ex-iPython) Notebook using the iPython kernel, I advise you to launch your notebook directly with the pyspark command:
>>>pyspark
But in order to do that, you need to add three lines in your bash .profile or zsh .zshrc profile to set these environment variables:
export SPARK_HOME=/path/to/apache-spark/1.4.1/libexec
export PYSPARK_DRIVER_PYTHON=ipython2 # remember that Apache-Spark only works with pyhton2.7
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
In my case, given that I'm on OSX , an installed apache-spark with Homebrew, this is:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.4.1/libexec
export PYSPARK_DRIVER_PYTHON=ipython2
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Then, when you execute the command 'pyspark' in your terminal, your terminal will automatically open a Jupyter (ex-iPython) notebook in your default Browser.
>>>pyspark
I 17:51:00.209 NotebookApp] Serving notebooks from local directory: /Users/Thibault/code/kaggle
[I 17:51:00.209 NotebookApp] 0 active kernels
[I 17:51:00.210 NotebookApp] The IPython Notebook is running at: http://localhost:42424/
[I 17:51:00.210 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 17:51:11.980 NotebookApp] Kernel started: 53ad11b1-4fa4-459d-804c-0487036b0f29
15/09/02 17:51:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I have a network program written in C using TCP sockets. Sometimes the client program hangs forever expecting input from server. Specifically, the client hangs on select() call set on an fd intended to read characters sent by server.
I am using strace to know where the process got stuck. However, sometimes when I attach the hung client process to strace, it immediately resumes it's execution and properly exits. Not all hung processes exhibit this behavior, some processes stuck in the select() even if I attach them to strace. But most of the processes resume their execution when attached to strace.
I am curious what causing the processes resume when attached to strace. It might give me clues to know why client processes are getting hung.
Any ideas? what causes a hung process to resume it's execution when attached to strace?
Update:
Here's the output of strace on hung processes.
> sudo strace -p 25645
Process 25645 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=25645 runs in 32 bit mode. ]
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "\0", 8192) = 1
write(2, "", 0) = 0
read(3, "====Setup set_oldtempbehaio"..., 8192) = 555
write(1, "====Setup set_oldtempbehaio"..., 555) = 555
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "", 8192) = 0
read(3, "", 8192) = 0
close(5) = 0
kill(25652, SIGKILL) = 0
exit_group(0) = ?
Process 25645 detached
_
> sudo strace -p 14462
Process 14462 attached - interrupt to quit
[ Process PID=14462 runs in 32 bit mode. ]
read(0, 0xff85fdbc, 8192) = -1 EIO (Input/output error)
shutdown(3, 1 /* send */) = 0
exit_group(0) = ?
_
> sudo strace -p 7517
Process 7517 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=7517 runs in 32 bit mode. ]
connect(3, {sa_family=AF_INET, sin_port=htons(300), sin_addr=inet_addr("100.64.220.98")}, 16) = -1 ETIMEDOUT (Connection timed out)
close(3) = 0
dup(2) = 3
fcntl64(3, F_GETFL) = 0x1 (flags O_WRONLY)
close(3) = 0
write(2, "dsd13: Connection timed out\n", 30) = 30
write(2, "Error code : 110\n", 17) = 17
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1) = ?
Process 7517 detached
Not just select(), but the processes(of same program) are stuck in various system calls before I attach them to strace. They suddenly resume after attaching to strace. If I don't attach them to strace, they just hang there forever.
Update 2:
I learned that strace could start a process which was previously stopped (process in T sate). Now I am trying to understand why did these processes go to 'T' state, what's the cause. Here's the /proc//status information:
> cat /proc/12554/status
Name: someone
State: T (stopped)
SleepAVG: 88%
Tgid: 12554
Pid: 12554
PPid: 9754
TracerPid: 0
Uid: 5000 5000 5000 5000
Gid: 48986 48986 48986 48986
FDSize: 256
Groups: 9149 48986
VmPeak: 1992 kB
VmSize: 1964 kB
VmLck: 0 kB
VmHWM: 608 kB
VmRSS: 608 kB
VmData: 156 kB
VmStk: 20 kB
VmExe: 16 kB
VmLib: 1744 kB
VmPTE: 20 kB
Threads: 1
SigQ: 54/73728
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000000004000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed: 00000000,00000000,00000000,0000000f
Mems_allowed: 00000000,00000001
strace uses ptrace. The ptrace man page has this:
Since attaching sends SIGSTOP and the tracer usually suppresses it,
this may cause a stray EINTR return from the currently executing system
call in the tracee, as described in the "Signal injection and
suppression" section.
Are you seeing select return EINTR?
I am currently running perl 5.8.8 on a server and I'm trying to install 5.14.
I configured it to usethreads and use64bitint and otherwise the defaults it suggested.
make ran without problems, but make test is failing, on
../cpan/IPC-SysV/t/ipcsysv.t
../cpan/IPC-SysV/t/shm.t
thus:
# ./perl harness ../cpan/IPC-SysV/t/shm.t ../cpan/IPC-SysV/t/ipcsysv.t
../cpan/IPC-SysV/t/shm.t ...... IPC::SharedMem->new failed: Invalid argument at t/shm.t line 54.
../cpan/IPC-SysV/t/shm.t ...... Dubious, test returned 22 (wstat 5632, 0x1600)
No subtests run
../cpan/IPC-SysV/t/ipcsysv.t .. 1/38 shmget failed: Invalid argument at t/ipcsysv.t line 100.
# Looks like you planned 38 tests but ran 17.
# Looks like your test exited with 22 just after 17.
../cpan/IPC-SysV/t/ipcsysv.t .. Dubious, test returned 22 (wstat 5632, 0x1600)
Failed 21/38 subtests
Test Summary Report
-------------------
../cpan/IPC-SysV/t/shm.t (Wstat: 5632 Tests: 0 Failed: 0)
Non-zero exit status: 22
Parse errors: No plan found in TAP output
../cpan/IPC-SysV/t/ipcsysv.t (Wstat: 5632 Tests: 17 Failed: 0)
Non-zero exit status: 22
Parse errors: Bad plan. You planned 38 tests but ran 17.
Files=2, Tests=17, 0 wallclock secs ( 0.01 usr 0.00 sys + 0.13 cusr 0.00 csys = 0.14 CPU)
Result: FAIL
Both of these tests are reporting 'Invalid argument', but when I look at the source, I can't see anything that looks invalid. I'm not really sure how to proceed... any pointers?
UPDATE
I ran
strace perl -MIPC::SysV=IPC_PRIVATE,S_IRWXU -e 'shmget(IPC_PRIVATE, 8, S_IRWXU) or die $!'
on two servers: one which is having these problems and one which is not.
There was a lot of output, but what appears interesting is this:
GOOD:
.
.
.
stat64("/usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/auto/IPC/SysV", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/site_perl/5.8.8/auto/IPC/SysV", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/site_perl/auto/IPC/SysV", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi/auto/IPC/SysV", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/vendor_perl/5.8.8/auto/IPC/SysV", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/vendor_perl/auto/IPC/SysV", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV/SysV.so", {st_mode=S_IFREG|0755, st_size=15072, ...}) = 0
stat64("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV/SysV.bs", 0x9d7f0c8) = -1 ENOENT (No such file or directory)
futex(0x4d106c, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV/SysV.so", O_RDONLY) = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0 \v\0\0004\0\0\0"..., 512) = 512
fstat64(4, {st_mode=S_IFREG|0755, st_size=15072, ...}) = 0
mmap2(NULL, 17948, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x588000
mmap2(0x58c000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x3) = 0x58c000
close(4) = 0
close(3) = 0
shmget(IPC_PRIVATE, 8, 0700) = 7438344
exit_group(0)
BAD:
.
.
.
stat64("/usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/auto/IPC/SysV", 0x8d290c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/site_perl/5.8.8/auto/IPC/SysV", 0x8d290c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/site_perl/auto/IPC/SysV", 0x8d290c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/vendor_perl/5.8.8/i386-linux-thread-multi/auto/IPC/SysV", 0x8d290c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/vendor_perl/5.8.8/auto/IPC/SysV", 0x8d290c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/vendor_perl/auto/IPC/SysV", 0x8d290c8) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV/SysV.so", {st_mode=S_IFREG|0755, st_size=15072, ...}) = 0
stat64("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV/SysV.bs", 0x8d290c8) = -1 ENOENT (No such file or directory)
futex(0x94306c, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/lib/perl5/5.8.8/i386-linux-thread-multi/auto/IPC/SysV/SysV.so", O_RDONLY) = 4
read(4, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0 \v\0\0004\0\0\0"..., 512) = 512
fstat64(4, {st_mode=S_IFREG|0755, st_size=15072, ...}) = 0
mmap2(NULL, 17948, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x6a4000
mmap2(0x6a8000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x3) = 0x6a8000
close(4) = 0
close(3) = 0
shmget(IPC_PRIVATE, 8, 0700) = -1 EINVAL (Invalid argument)
open("/usr/share/locale/locale.alias", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7dbe000
read(3, "# Locale name alias data base.\n#"..., 4096) = 2528
read(3, "", 4096) = 0
close(3) = 0
munmap(0xb7dbe000, 4096) = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, "Invalid argument at -e line 1.\n", 31Invalid argument at -e line 1.
) = 31
exit_group(22) = ?
So, it appears that the same thing is happening on both servers, it's just that on one, I see
shmget(IPC_PRIVATE, 8, 0700) = 7438344
and the other, I see
shmget(IPC_PRIVATE, 8, 0700) = -1 EINVAL (Invalid argument)
The versions of IPC::SysV are the same on both servers... but it looks to me that this isn't relevant, and that the problem is the the code making the system call... right?
What next?
** UPDATE 2 **
After some googling, I ran the following:
GOOD:
# cat /proc/sys/kernel/shmmax
4294967295
BAD:
# cat /proc/sys/kernel/shmmax
0
So, that explains the EINVAL, since (from the man pages)
EINVAL
A new segment was to be created and size < SHMMIN or size > SHMMAX, or no new segment
was to be created, a segment with given key existed, but size is greater than the size
of that segment.
Now, my question is, is there a good reason why this might be set at zero?
Problem solved.
The /etc/sysctl.conf file contained the following:
kernel.shmmax = 137438953472
This is a 64 bit value, but the system is a 32 bit system.
As a result, the SHMMAX value was being set to 0, making all calls to shmget fail.
Changing it to
kernel.shmmax = 4294967295
And using
echo 4294967295 >/proc/sys/kernel/shmmax
I changed the value of SHMMAX and the test completed successfully.