Ceph No cluster conf found in /etc/ceph with fsid - deployment

I read the official documentation about a quick ceph deploy and I get always the same error in the section where we activate OSDs:
ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1
This command do not work and always shows the same log:
[2016-01-29 14:19:54,024][ceph_deploy.conf][DEBUG ] found configuration file at: /home/admin/.cephdeploy.conf
[2016-01-29 14:19:54,032][ceph_deploy.cli][INFO ] Invoked (1.5.30): /usr/bin/ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1
[2016-01-29 14:19:54,033][ceph_deploy.cli][INFO ] ceph-deploy options:
[2016-01-29 14:19:54,033][ceph_deploy.cli][INFO ] username : None
[2016-01-29 14:19:54,034][ceph_deploy.cli][INFO ] verbose : False
[2016-01-29 14:19:54,035][ceph_deploy.cli][INFO ] overwrite_conf : False
[2016-01-29 14:19:54,036][ceph_deploy.cli][INFO ] subcommand : activate
[2016-01-29 14:19:54,037][ceph_deploy.cli][INFO ] quiet : False
[2016-01-29 14:19:54,038][ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f866bc90368>
[2016-01-29 14:19:54,040][ceph_deploy.cli][INFO ] cluster : ceph
[2016-01-29 14:19:54,041][ceph_deploy.cli][INFO ] func : <function osd at 0x7f866bee75f0>
[2016-01-29 14:19:54,042][ceph_deploy.cli][INFO ] ceph_conf : None
[2016-01-29 14:19:54,043][ceph_deploy.cli][INFO ] default_release : False
[2016-01-29 14:19:54,044][ceph_deploy.cli][INFO ] disk : [('node2', '/var/local/osd0', None), ('node3', '/var/local/osd1', None)]
[2016-01-29 14:19:54,058][ceph_deploy.osd][DEBUG ] Activating cluster ceph disks node2:/var/local/osd0: node3:/var/local/osd1:
[2016-01-29 14:19:56,498][node2][DEBUG ] connection detected need for sudo
[2016-01-29 14:19:58,497][node2][DEBUG ] connected to host: node2
[2016-01-29 14:19:58,516][node2][DEBUG ] detect platform information from remote host
[2016-01-29 14:19:58,601][node2][DEBUG ] detect machine type
[2016-01-29 14:19:58,609][node2][DEBUG ] find the location of an executable
[2016-01-29 14:19:58,613][ceph_deploy.osd][INFO ] Distro info: debian 8.3 jessie
[2016-01-29 14:19:58,615][ceph_deploy.osd][DEBUG ] activating host node2 disk /var/local/osd0
[2016-01-29 14:19:58,617][ceph_deploy.osd][DEBUG ] will use init type: systemd
[2016-01-29 14:19:58,622][node2][INFO ] Running command: sudo ceph-disk -v activate --mark-init systemd --mount /var/local/osd0
[2016-01-29 14:19:58,816][node2][WARNING] DEBUG:ceph-disk:Cluster uuid is eacfd426-58a3-44e8-a6f0-636a6b23e89e
[2016-01-29 14:19:58,818][node2][WARNING] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[2016-01-29 14:19:59,401][node2][WARNING] Traceback (most recent call last):
[2016-01-29 14:19:59,403][node2][WARNING] File "/usr/sbin/ceph-disk", line 3576, in <module>
[2016-01-29 14:19:59,405][node2][WARNING] main(sys.argv[1:])
[2016-01-29 14:19:59,406][node2][WARNING] File "/usr/sbin/ceph-disk", line 3530, in main
[2016-01-29 14:19:59,407][node2][WARNING] args.func(args)
[2016-01-29 14:19:59,409][node2][WARNING] File "/usr/sbin/ceph-disk", line 2432, in main_activate
[2016-01-29 14:19:59,410][node2][WARNING] init=args.mark_init,
[2016-01-29 14:19:59,412][node2][WARNING] File "/usr/sbin/ceph-disk", line 2258, in activate_dir
[2016-01-29 14:19:59,413][node2][WARNING] (osd_id, cluster) = activate(path, activate_key_template, init)
[2016-01-29 14:19:59,415][node2][WARNING] File "/usr/sbin/ceph-disk", line 2331, in activate
[2016-01-29 14:19:59,416][node2][WARNING] raise Error('No cluster conf found in ' + SYSCONFDIR + ' with fsid %s' % ceph_fsid)
[2016-01-29 14:19:59,418][node2][WARNING] __main__.Error: Error: No cluster conf found in /etc/ceph with fsid eacfd426-58a3-44e8-a6f0-636a6b23e89e
[2016-01-29 14:19:59,443][node2][ERROR ] RuntimeError: command returned non-zero exit status: 1
[2016-01-29 14:19:59,445][ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init systemd --mount /var/local/osd0
I am working in Debian 8.3. I have done all those points till the OSDs active. I mounted 10GB ext4 partitions on node2 /var/local/osd0 and on node3 /var/local/osd1.
After an OSDs prepare command there appeared some files but an OSDs active command still don't work.
Can somebody help me?

It happened because I had same disk IDs on all nodes. After I had changed ids with fdisk my cluster became work.

Related

GitLab K8s Runner fails for get_sources

we are trying to move gitlab-runners from standard CentOS VMs to kebernetes.
But after setup and registration, pipeline fails with unknown error:
Running with gitlab-runner 15.7.0 (259d2fd4)
on Kubernetes-local JXRw3mH1
Preparing the "kubernetes" executor
00:00
Using Kubernetes namespace: gitlab-runner
Using Kubernetes executor with image gitlab-test.domain:5005/image:latest ...
Using attach strategy to execute scripts...
Preparing environment
00:04
Waiting for pod gitlab-runner/runner-jxrw3mh1-project-290-concurrent-0dpd88 to be running, status is Pending
Running on runner-jxrw3mh1-project-290-concurrent-0dpd88 via gitlab-runner-d7df6c548-hsgxg...
Getting source from Git repository
00:00
error: could not lock config file /root/.gitconfig: Read-only file system
Cleaning up project directory and file based variables
00:01
ERROR: Job failed: command terminated with exit code 1
Inside the log of the job pod we found:
helper Running on runner-jxrw3mh1-project-290-concurrent-0dpd88 via gitlab-runner-d7df6c548-hsgxg...
helper
helper {"command_exit_code": 0, "script": "/scripts-290-207166/prepare_script"}
helper error: could not lock config file /root/.gitconfig: Read-only file system
helper
helper {"command_exit_code": 1, "script": "/scripts-290-207166/get_sources"}
helper
helper {"command_exit_code": 0, "script": "/scripts-290-207166/cleanup_file_variables"}
Inside the log of the gitlab-runner pod we found:
Starting in container "helper" the command ["gitlab-runner-build" "<<<" "/scripts-290-207167/get_sources" "2>&1 | tee -a /logs-290-207167/output.log"] with script: #!/usr/bin/env bash
if set -o | grep pipefail > /dev/null; then set -o pipefail; fi; set -o errexit
set +o noclobber
: | eval $'export FF_CMD_DISABLE_DELAYED_ERROR_LEVEL_EXPANSION=$\'false\'\nexport FF_NETWORK_PER_BUILD=$\'false\'\nexport FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=$\'false\'\nexport FF_USE_DIRECT_DOWNLOAD
exit 0
job=207167 project=290 runner=JXRw3mH1
Remote process exited with the status: CommandExitCode: 1, Script: /scripts-290-207167/get_sources job=207167 project=290 runner=JXRw3mH1
Container "helper" exited with error: command terminated with exit code 1 job=207167 project=290 runner=JXRw3mH1
notes:
the error "error: could not lock config file /root/.gitconfig: Read-only file system" is due to the current user inside container is different by root
the file /logs-290-207167/output.log contains the log of the job pod
Inside job pod shell we also tested some git commands and perform successfully fetch and clone using our personal credentials (the same user that perform the run of the pipeline from gitlab gui).
We think the problem can be related on gitlab-ci-token, but we have finished our investigation... :frowning:

airflow SSH operator error: [Errno 2] No such file or directory:

airflow 1.10.10
minikube 1.22.0
amazon emr
I am running airflow on kubernetes(minikube).
Dags are synced from github.
spark-submit on Amazon EMR as a CLI mode.
In order to do that, I attach EMR pem key.
So, I get pem key from AWS S3 while ExtraInitContainer is getting image awscli and mount the volume at airlfow/sshpem
error is reported when I make a connection from airflow WebUI as
"con_type": "ssh"
"key_file": "/opt/sshepm/emr.pem"
SSH operator error: [Errno 2] No such file or directory: '/opt/airflow/sshpem/emr.pem'
it is there. I think it is related to some PATH or permission issue since I get emr.pem on ExtraInitContainer and it's permission was root. Although I temporarily changed a user as 1000:1000 there is some issue airflow WebUI can't get this directory while getting a key.
Full log is below
> Traceback (most recent call last): File
> "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/operators/ssh_operator.py",
> line 108, in execute
> with self.ssh_hook.get_conn() as ssh_client: File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/hooks/ssh_hook.py",
> line 194, in get_conn
> client.connect(**connect_kwargs) File "/home/airflow/.local/lib/python3.6/site-packages/paramiko/client.py",
> line 446, in connect
> passphrase, File "/home/airflow/.local/lib/python3.6/site-packages/paramiko/client.py",
> line 677, in _auth
> key_filename, pkey_class, passphrase File "/home/airflow/.local/lib/python3.6/site-packages/paramiko/client.py",
> line 586, in _key_from_filepath
> key = klass.from_private_key_file(key_path, password) File "/home/airflow/.local/lib/python3.6/site-packages/paramiko/pkey.py",
> line 235, in from_private_key_file
> key = cls(filename=filename, password=password) File "/home/airflow/.local/lib/python3.6/site-packages/paramiko/rsakey.py",
> line 55, in __init__
> self._from_private_key_file(filename, password) File "/home/airflow/.local/lib/python3.6/site-packages/paramiko/rsakey.py",
> line 175, in _from_private_key_file
> data = self._read_private_key_file("RSA", filename, password) File
> "/home/airflow/.local/lib/python3.6/site-packages/paramiko/pkey.py",
> line 307, in _read_private_key_file
> with open(filename, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/sshpem/emr-pa.pem'
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last): File
> "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py",
> line 979, in _run_raw_task
> result = task_copy.execute(context=context) File "/opt/airflow/class101-airflow/plugins/operators/emr_ssh_operator.py",
> line 107, in execute
> super().execute(context) File "/home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/operators/ssh_operator.py",
> line 177, in execute
> raise AirflowException("SSH operator error: {0}".format(str(e))) airflow.exceptions.AirflowException: SSH operator error: [Errno 2] No
> such file or directory: '/opt/airflow/sshpem/emr-pa.pem' [2021-07-14
> 05:40:31,624] Marking task as UP_FOR_RETRY. dag_id=test_staging,
> task_id=extract_categories_from_mongo, execution_date=20210712T190000,
> start_date=20210714T054031, end_date=20210714T054031 [2021-07-14
> 05:40:36,303] Task exited with return code 1
airflow home: /opt/airflow
dags : /opt/airflow//dags
pemkey : /opt/sshpem/
airflow.cfg: /opt/airflow
airflow_env:
export PATH="/home/airflow/.local/bin:$PATH"
my yaml file
airflow:
image:
repository: airflow
executor: KubernetesExecutor
extraVolumeMounts:
- name: sshpem
mountPath: /opt/airflow/sshpem
extraVolumes:
- name: sshpem
emptyDir: {}
scheduler:
extraInitContainers:
- name: emr-key-file-download
image: amazon/aws-cli
command: [
"sh",
"-c",
"aws s3 cp s3://mykeyfile/path.my.pem&& \
chown -R 1000:1000 /opt/airflow/sshpem/"
volumeMounts:
- mountPath: /opt/airflow/sshpem
name: sshpem
Are you using KubernetesExecutor or CeleryExecutor?
If the former, then you have to make sure the extra init container is added to the pod_template you are using (tasks in KubernetesExecutor) run as separate PODs.
If the latter, you should make sure the extra init container is also added for workers, not only for scheduler).
BTW. Airflow 1.10 reached end-of-life on June 17th, 2021 and it will not receive even critical security fixes. You can watch our talk from the recent Airflow Summit "Keep your Airflow Secure" - https://airflowsummit.org/sessions/2021/panel-airflow-security/ to learn why it is important to upgrade to Airflow 2.

Can't run Yocto image with runqemu like described in documentation

I would like to run a Yocto image in QEMU but the way how it's described in the documentation doesn't work.
To verify I'm not doing anything wrong i followed the steps in the quick build guide:
install required packages
clone poky
checkout correct version
source build environment
set machine to qemux86 in local.conf
add sstate-mirrors and allow parallel build
start bitbake core-image-sato
do something else for the next few hours
when I try now to run that image in qemu like describe in the documentation:
runqemu qemux86
I just get the following output, nothing happens:
runqemu - INFO - Running MACHINE=qemux86 bitbake -e...
runqemu - INFO - Continuing with the following parameters:
KERNEL: [/mnt/wwn-0x50014ee0576fe9ef- part1/test_python3_in_yocto/build/tmp/deploy/images/qemux86/bzImage--4.14.76+git0+3435617380_2c5caa7e84-r0-qemux86-20190305114605.bin]
MACHINE: [qemux86]
FSTYPE: [ext4]
ROOTFS: [/mnt/wwn-0x50014ee0576fe9ef-part1/test_python3_in_yocto/build/tmp/deploy/images/qemux86/core-image-base-qemux86-20190305151244.rootfs.ext4]
CONFFILE: [/mnt/wwn-0x50014ee0576fe9ef-part1/test_python3_in_yocto/build/tmp/deploy/images/qemux86/core-image-base-qemux86-20190305151244.qemuboot.conf]
runqemu - INFO - Setting up tap interface under sudo
runqemu - INFO - Network configuration: 192.168.7.2::192.168.7.1:255.255.255.0
runqemu - INFO - Running /mnt/wwn-0x50014ee0576fe9ef-part1/test_python3_in_yocto/build/tmp/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-i386 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -drive file=/mnt/wwn-0x50014ee0576fe9ef-part1/test_python3_in_yocto/build/tmp/deploy/images/qemux86/core-image-base-qemux86-20190305151244.rootfs.ext4,if=virtio,format=raw -vga vmware -show-cursor -usb -device usb-tablet -device virtio-rng-pci -cpu pentium2 -m 256 -serial mon:vc -serial null -kernel /mnt/wwn-0x50014ee0576fe9ef-part1/test_python3_in_yocto/build/tmp/deploy/images/qemux86/bzImage--4.14.76+git0+3435617380_2c5caa7e84-r0-qemux86-20190305114605.bin -append 'root=/dev/vda rw highres=off mem=256M ip=192.168.7.2::192.168.7.1:255.255.255.0 vga=0 uvesafb.mode_option=640x480-32 oprofile.timer=1 uvesafb.task_timeout=-1 '
When I try to run qemu without graphics I get a kernel panic:
runqemu nographic qemux86
...
[ 6.171521] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
[ 6.172937] VFS: Mounted root (ext4 filesystem) on device 253:0.
[ 6.175806] devtmpfs: error mounting -2
[ 6.237143] Freeing unused kernel memory: 852K
[ 6.238001] Write protecting the kernel text: 8752k
[ 6.238722] Write protecting the kernel read-only data: 2376k
[ 6.244382] Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentation/admin.
[ 6.245455] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.76-yocto-standard #1
[ 6.245913] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/4
[ 6.246730] Call Trace:
[ 6.247788] dump_stack+0x58/0x72
[ 6.248071] ? rest_init+0x90/0xc0
[ 6.248320] panic+0x94/0x1c6
[ 6.248529] ? rest_init+0xc0/0xc0
[ 6.248807] kernel_init+0xda/0xf0
[ 6.249046] ret_from_fork+0x2e/0x38
[ 6.249834] Kernel Offset: 0xd800000 from 0xc1000000 (relocation range: 0xc0000000-0xd07dbfff)
[ 6.250595] ---[ end Kernel panic - not syncing: No working init found. Try passing init= option to kernel. See Linux Documentat.
Is there something missing in the documentation?
I also tried different images...
core-image-sato
core-image-base
core-image-minimal
and finally I tried it with version 2.5.2 (sumo) instead of 2.6.1 (thud)... but no change...
When I googled for that issue I didn't really find anything helpful except increasing the memory, which didn't change anything, so I hope anyone here someone knows whats wrong...
Probably missing an appendage to the second parameter.
Browse to the folder in the error message, and check the directory qemux86 exists, chances are it doesn't.
I had to
runqemu qemux86-64
and not
runqemu qemux86
If you haven't changed MACHINE variable in your local.conf file in build directory, check out the file and make sure your image name was identical.

django celery daemon does work: it can't create pid file

I can't init mi celeryd and celerybeat service, I used the same code on another enviroment (configuring everything from the start) but here don't work. I think this was by permissions but I could'nt run it. please help me.
this is my celery conf on settings.py
CELERY_RESULT_BACKEND = ‘djcelery.backends.database:DatabaseBackend’
CELERY_BROKER_URL = ‘amqp://localhost’
CELERY_ACCEPT_CONTENT = [‘json’]
CELERY_TASK_SERIALIZER = ‘json’
CELERY_RESULT_SERIALIZER = ‘json’
CELERYBEAT_SCHEDULER = ‘djcelery.schedulers.DatabaseScheduler’
CELERY_ENABLE_UTC = True
CELERY_TIMEZONE = TIME_ZONE # ‘America/Lima’
CELERY_BEAT_SCHEDULE= {}
this is my file /etc/init.d/celeryd
https://github.com/celery/celery/blob/master/extra/generic-init.d/celeryd
then I use
sudo chmod 755 /etc/init.d/celeryd
sudo chown admin1:admin1 /etc/init.d/celeryd
and I created /etc/default/celeryd
CELERY_BIN="/home/admin1/Env/tos/bin/celery"
# App instance to use
CELERY_APP="tos"
# Where to chdir at start.
CELERYD_CHDIR="/home/admin1/webapps/tos/"
# Extra command-line arguments to the worker
CELERYD_OPTS="--time-limit=300 --concurrency=8"
# %n will be replaced with the first part of the nodename.
CELERYD_LOG_FILE="/var/log/celery/%n%I.log"
CELERYD_PID_FILE="/var/run/celery/%n.pid"
# Workers should run as an unprivileged user.
# You need to create this user manually (or you can choose
# a user/group combination that already exists (e.g., nobody).
CELERYD_USER="admin1"
CELERYD_GROUP="admin1"
# If enabled pid and log directories will be created if missing,
# and owned by the userid/group configured.
CELERY_CREATE_DIRS=1
export SECRET_KEY="foobar"
for celerybeat I create a file on /etc/init.d/celerybeat
with:
https://github.com/celery/celery/blob/master/extra/generic-init.d/celerybeat
and start service like this:
sudo /etc/init.d/celeryd start
sudo /etc/init.d/celerybeat start
and I have this error:
sudo: imposible resolver el anfitrión SIO
celery init v10.1.
Using config script: /etc/default/celeryd
celery multi v3.1.25 (Cipater)
> Starting nodes...
> celery#SIO-PRODUCION: OK
ERROR: Pidfile (celery.pid) already exists.
Seems we're already running? (pid: 30198)
/etc/init.d/celeryd: 515: /etc/init.d/celeryd: --pidfile=/var/run/celery/%n.pid: not found
I also I got it when check it with :
sudo C_FAKEFORK=1 sh -x /etc/init.d/celeryd start
some data .....
starting nodes...
ERROR: Pidfile (celery.pid) already exists.
Seems we're already running? (pid: 30198)
> celery#SIO-PRODUCION: * Child terminated with errorcode 73
FAILED
+ --pidfile=/var/run/celery/%n.pid
/etc/init.d/celeryd: 515: /etc/init.d/celeryd: --pidfile=/var/run/celery/%n.pid: not found
+ --logfile=/var/log/celery/%n%I.log
/etc/init.d/celeryd: 517: /etc/init.d/celeryd: --logfile=/var/log/celery/%n%I.log: not found
+ --loglevel=INFO
/etc/init.d/celeryd: 519: /etc/init.d/celeryd: --loglevel=INFO: not found
+ --app=tos
/etc/init.d/celeryd: 521: /etc/init.d/celeryd: --app=tos: not found
+ --time-limit=300 --concurrency=8
/etc/init.d/celeryd: 523: /etc/init.d/celeryd: --time-limit=300: not found
+ exit 0
I have the same problem and i resolved her so:
rm -f /webapps/celery.pid && /etc/init.d/celeryd start
You can try do this. Before running celery clean up pid-files through ampersand by gluing commands.
Other way, create a django command
import shlex
import subprocess
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def handle(self, *args, **options):
kill_worker_cmd = 'pkill -9 celery'
subprocess.call(shlex.split(kill_worker_cmd))
Call it before you start, or just
pkill -9 celery

MongoDB thinks it's running a replica?

I'm running MongoDB (version 2.4) behind a Django application served by Apache.
In the past few days, I'm seeing the following error come up many times an hour in the Apache logs, on all sorts of different requests:
AutoReconnect: not master and slaveOk=false
I did not explicitly create a replica, and to the best of my knowledge am not running one. rs.status() says that we are not running with --replset.
Mongo is run with:
'mongod --dbpath /srv/db/mongodb/ --fork --logpath /var/log/mongodb.log --logappend --auth'
There is one mongod process running on the server.
What's going on here?
Edit -
Here's the tail end of the stack trace, as requested.
File "/var/www/sefaria_dev/sefaria/texts.py", line 916, in parse_ref
shorthand = db.index.find_one({"maps": {"$elemMatch": {"from": pRef["book"]}}})
File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 604, in find_one
for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 904, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 848, in _refresh
self.__uuid_subtype))
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 800, in __send_message
self.__uuid_subtype)
File "/usr/local/lib/python2.7/dist-packages/pymongo/helpers.py", line 98, in _unpack_response
raise AutoReconnect(error_object["$err"])
AutoReconnect: not master and slaveOk=false
rs.status() returns:
{
"ok" : 0,
"errmsg" : "not running with --replSet"
}
rs.conf() returns null.
I haven't seen an indication of error in the mongodb.log that corresponds to one of the apache.log errors.