New ceph cluster keep pg undersized+peered and rbd ls stuck - ceph

I've created a new ceph cluster with 1mon 1mds 1mgr and 15osd. After setup everything is right, but pg state keeps undersized+peered. All disk is freshly installed stand-alone XFS, size range from 3TB to 4TB without a partition table. All osd log shows nothing useful.
Here is my ceph -s log:
cluster:
id: 19e50b60-31b0-467a-8ea9-6c37742a1f77
health: HEALTH_WARN
Reduced data availability: 8 pgs inactive
Degraded data redundancy: 8 pgs undersized
1 monitors have not enabled msgr2
services:
mon: 1 daemons, quorum wuminghan-K600-1G (age 25m)
mgr: wuminghan-K600-1G(active, since 24m)
osd: 15 osds: 15 up (since 23m), 15 in (since 8h)
data:
pools: 1 pools, 8 pgs
objects: 0 objects, 0 B
usage: 15 GiB used, 135 GiB / 150 GiB avail
pgs: 100.000% pgs not active
8 undersized+peered
Here is my ceph.conf:
[global]
fsid = 19e50b60-31b0-467a-8ea9-6c37742a1f77
mon initial members = wuminghan-K600-1G
mon host = 192.168.0.237
public network = 192.168.0.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
osd pool default size = 1
osd pool default min size = 1
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1
mon allow pool delete = true
[mgr.wuminghan-K600-1G]
host = wuminghan-K600-1G
[mds.wuminghan-K600-1G]
host = wuminghan-K600-1G
[osd.0]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2112b4c78
debug osd = 20
debug filestore = 20
[osd.1]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee26601e571
debug osd = 20
debug filestore = 20
[osd.2]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee266717eb3
debug osd = 20
debug filestore = 20
[osd.3]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee266806f32
debug osd = 20
debug filestore = 20
[osd.4]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee266808610
debug osd = 20
debug filestore = 20
[osd.5]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee266808651
debug osd = 20
debug filestore = 20
[osd.6]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee266808b36
debug osd = 20
debug filestore = 20
[osd.7]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bb578f3e
debug osd = 20
debug filestore = 20
[osd.8]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bb57915c
debug osd = 20
debug filestore = 20
[osd.9]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bbc75bb4
debug osd = 20
debug filestore = 20
[osd.10]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bbd63771
debug osd = 20
debug filestore = 20
[osd.11]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bbd63795
debug osd = 20
debug filestore = 20
[osd.12]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bbd64ee9
debug osd = 20
debug filestore = 20
[osd.13]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bbd64fe8
debug osd = 20
debug filestore = 20
[osd.14]
host = wuminghan-K600-1G
devs = /dev/disk/by-id/wwn-0x50014ee2bbd65340
debug osd = 20
debug filestore = 20
Every step is followed by manual setup of the ceph official document. Setup using ceph-deploy was also tried before without no success.
Can create pool successfully. But run rbd ls or rbd pool init rbd will hang forever with no output.

Related

Why is my ceph cluster value(964G) of raw used in global secion far higher than that(244G) of used in pools sectio

Why is my ceph cluster value(964G) of raw used in global section far higher than that(244G) of used in pools section
[en#ceph01 ~]$ sudo ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
6.00TiB 5.06TiB 964GiB 15.68
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
.rgw.root 1 1.09KiB 0 1.56TiB 4
default.rgw.control 2 0B 0 1.56TiB 8
default.rgw.meta 3 0B 0 1.56TiB 0
default.rgw.log 4 0B 0 1.56TiB 207
cephfs_data 5 244GiB 9.22 2.34TiB 4829661
cephfs_meta 6 168MiB 0 2.34TiB 4160
[en#ceph01 ~]$ sudo ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE USE DATA OMAP META AVAIL %USE VAR PGS
0 hdd 2.00000 1.00000 2.00TiB 331GiB 326GiB 1.64GiB 3.38GiB 1.68TiB 16.17 1.03 77
1 hdd 2.00000 1.00000 2.00TiB 346GiB 341GiB 1.69GiB 3.51GiB 1.66TiB 16.90 1.08 78
2 hdd 2.00000 1.00000 2.00TiB 286GiB 282GiB 1.31GiB 2.96GiB 1.72TiB 13.97 0.89 69
TOTAL 6.00TiB 964GiB 949GiB 4.64GiB 9.86GiB 5.06TiB 15.68
MIN/MAX VAR: 0.89/1.08 STDDEV: 1.24
info about ceph cluster:
>pool 5 'cephfs_data' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 33 flags hashpspool stripe_width 0 application cephfs..
>pool 6 'cephfs_meta' replicated size 2min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 31 flags hashpspool stripe_width 0 application cephfs
> max_osd 3
This is due to bluestore_min_alloc_size_hdd being most likely set at 64K.
More info here: ceph df (octopus) shows USED is 7 times higher than STORED in erasure coded pool

Kubernetes: kube-scheduler is not correctly scoring nodes for pod assignment

I am running Kubernetes with Rancher, and I am seeing weird behavior with the kube-scheduler. After adding a third node, I expect to see pods start to get scheduled & assigned to it. However, the kube-scheduler scores this new third node node3 with the lowest score, even though it has almost no pods running in it, and I expect it to receive the highest score.
Here are the logs from the Kube-scheduler:
scheduling_queue.go:815] About to try and schedule pod namespace1/pod1
scheduler.go:456] Attempting to schedule pod: namespace1/pod1
predicates.go:824] Schedule Pod namespace1/pod1 on Node node1 is allowed, Node is running only 94 out of 110 Pods.
predicates.go:1370] Schedule Pod namespace1/pod1 on Node node1 is allowed, existing pods anti-affinity terms satisfied.
predicates.go:824] Schedule Pod namespace1/pod1 on Node node3 is allowed, Node is running only 4 out of 110 Pods.
predicates.go:1370] Schedule Pod namespace1/pod1 on Node node3 is allowed, existing pods anti-affinity terms satisfied.
predicates.go:824] Schedule Pod namespace1/pod1 on Node node2 is allowed, Node is running only 95 out of 110 Pods.
predicates.go:1370] Schedule Pod namespace1/pod1 on Node node2 is allowed, existing pods anti-affinity terms satisfied.
resource_allocation.go:78] pod1 -> node1: BalancedResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 40230 millicores 122473676800 memory bytes, score 7
resource_allocation.go:78] pod1 -> node1: LeastResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 40230 millicores 122473676800 memory bytes, score 3
resource_allocation.go:78] pod1 -> node3: BalancedResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 800 millicores 807403520 memory bytes, score 9
resource_allocation.go:78] pod1 -> node3: LeastResourceAllocation, capacity 56000 millicores 270255251456 memory bytes, total request 800 millicores 807403520 memory bytes, score 9
resource_allocation.go:78] pod1 -> node2: BalancedResourceAllocation, capacity 56000 millicores 270255247360 memory bytes, total request 43450 millicores 133693440000 memory bytes, score 7
resource_allocation.go:78] pod1 -> node2: LeastResourceAllocation, capacity 56000 millicores 270255247360 memory bytes, total request 43450 millicores 133693440000 memory bytes, score 3
generic_scheduler.go:748] pod1_namespace1 -> node1: TaintTolerationPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node3: TaintTolerationPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node2: TaintTolerationPriority, Score: (10)
selector_spreading.go:146] pod1 -> node1: SelectorSpreadPriority, Score: (10)
selector_spreading.go:146] pod1 -> node3: SelectorSpreadPriority, Score: (10)
selector_spreading.go:146] pod1 -> node2: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node1: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node3: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node2: SelectorSpreadPriority, Score: (10)
generic_scheduler.go:748] pod1_namespace1 -> node1: NodeAffinityPriority, Score: (0)
generic_scheduler.go:748] pod1_namespace1 -> node3: NodeAffinityPriority, Score: (0)
generic_scheduler.go:748] pod1_namespace1 -> node2: NodeAffinityPriority, Score: (0)
interpod_affinity.go:232] pod1 -> node1: InterPodAffinityPriority, Score: (0)
interpod_affinity.go:232] pod1 -> node3: InterPodAffinityPriority, Score: (0)
interpod_affinity.go:232] pod1 -> node2: InterPodAffinityPriority, Score: (10)
generic_scheduler.go:803] Host node1 => Score 100040
generic_scheduler.go:803] Host node3 => Score 100038
generic_scheduler.go:803] Host node2 => Score 100050
scheduler_binder.go:256] AssumePodVolumes for pod "namespace1/pod1", node "node2"
scheduler_binder.go:266] AssumePodVolumes for pod "namespace1/pod1", node "node2": all PVCs bound and nothing to do
factory.go:727] Attempting to bind pod1 to node2
I can tell from the logs that your pod will always be scheduled on node2 because it seems like you have some sort of PodAffinity that scores an additional 10. Making it go to 50.
What's kind of odd is that I'm scoring 48 for node3 but it seems like -10 is being stuck there somewhere (totaling 38). Perhaps because of the affinity, or some entry not being shown in the logs, or plain simply a bug in the way the scheduler is doing the calculation. You'll probably have to dig deep into the kube-scheduler code if you'd like to find out more.
This is what I have:
node1 7 + 3 + 10 + 10 + 10 = 40
node2 7 + 3 + 10 + 10 + 10 + 10 = 50
node3 9 + 9 + 10 + 10 + 10 = 48

CEPH raw space usage

I can't understand, where my ceph raw space is gone.
cluster 90dc9682-8f2c-4c8e-a589-13898965b974
health HEALTH_WARN 72 pgs backfill; 26 pgs backfill_toofull; 51 pgs backfilling; 141 pgs stuck unclean; 5 requests are blocked > 32 sec; recovery 450170/8427917 objects degraded (5.341%); 5 near full osd(s)
monmap e17: 3 mons at {enc18=192.168.100.40:6789/0,enc24=192.168.100.43:6789/0,enc26=192.168.100.44:6789/0}, election epoch 734, quorum 0,1,2 enc18,enc24,enc26
osdmap e3326: 14 osds: 14 up, 14 in
pgmap v5461448: 1152 pgs, 3 pools, 15252 GB data, 3831 kobjects
31109 GB used, 7974 GB / 39084 GB avail
450170/8427917 objects degraded (5.341%)
18 active+remapped+backfill_toofull
1011 active+clean
64 active+remapped+wait_backfill
8 active+remapped+wait_backfill+backfill_toofull
51 active+remapped+backfilling
recovery io 58806 kB/s, 14 objects/s
OSD tree (each host has 2 OSD):
# id weight type name up/down reweight
-1 36.45 root default
-2 5.44 host enc26
0 2.72 osd.0 up 1
1 2.72 osd.1 up 0.8227
-3 3.71 host enc24
2 0.99 osd.2 up 1
3 2.72 osd.3 up 1
-4 5.46 host enc22
4 2.73 osd.4 up 0.8
5 2.73 osd.5 up 1
-5 5.46 host enc18
6 2.73 osd.6 up 1
7 2.73 osd.7 up 1
-6 5.46 host enc20
9 2.73 osd.9 up 0.8
8 2.73 osd.8 up 1
-7 0 host enc28
-8 5.46 host archives
12 2.73 osd.12 up 1
13 2.73 osd.13 up 1
-9 5.46 host enc27
10 2.73 osd.10 up 1
11 2.73 osd.11 up 1
Real usage:
/dev/rbd0 14T 7.9T 5.5T 59% /mnt/ceph
Pool size:
osd pool default size = 2
Pools:
ceph osd lspools
0 data,1 metadata,2 rbd,
rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
data - 0 0 0 0 0 0 0 0 0
metadata - 0 0 0 0 0 0 0 0 0
rbd - 15993591918 3923880 0 444545 0 82936 1373339 2711424 849398218
total used 32631712348 3923880
total avail 8351008324
total space 40982720672
Raw usage is 4x real usage. As I understand, it must be 2x ?
Yes, it must be 2x. I don't really shure, that the real raw usage is 7.9T. Why do you check this value on mapped disk?
This are my pools:
pool name KB objects clones degraded unfound rd rd KB wr wr KB
admin-pack 7689982 1955 0 0 0 693841 3231750 40068930 353462603
public-cloud 105432663 26561 0 0 0 13001298 638035025 222540884 3740413431
rbdkvm_sata 32624026697 7968550 31783 0 0 4950258575 232374308589 12772302818 278106113879
total used 98289353680 7997066
total avail 34474223648
total space 132763577328
You can see, that the total amount of used space is 3 times more than the used space in the pool rbdkvm_sata (+-).
ceph -s shows the same result too:
pgmap v11303091: 5376 pgs, 3 pools, 31220 GB data, 7809 kobjects
93736 GB used, 32876 GB / 123 TB avail
I don't think you have just one rbd image. The result of "ceph osd lspools" indicated that you had 3 pools and one of pools had name "metadata".(Maybe you were using cephfs). /dev/rbd0 was appeared because you mapped the image but you could have other images also. To list the images you can use "rbd list -p ". You can see the image info with "rbd info -p "

Hung processes resume if attached to strace

I have a network program written in C using TCP sockets. Sometimes the client program hangs forever expecting input from server. Specifically, the client hangs on select() call set on an fd intended to read characters sent by server.
I am using strace to know where the process got stuck. However, sometimes when I attach the hung client process to strace, it immediately resumes it's execution and properly exits. Not all hung processes exhibit this behavior, some processes stuck in the select() even if I attach them to strace. But most of the processes resume their execution when attached to strace.
I am curious what causing the processes resume when attached to strace. It might give me clues to know why client processes are getting hung.
Any ideas? what causes a hung process to resume it's execution when attached to strace?
Update:
Here's the output of strace on hung processes.
> sudo strace -p 25645
Process 25645 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=25645 runs in 32 bit mode. ]
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "\0", 8192) = 1
write(2, "", 0) = 0
read(3, "====Setup set_oldtempbehaio"..., 8192) = 555
write(1, "====Setup set_oldtempbehaio"..., 555) = 555
select(6, [3 5], NULL, NULL, NULL) = 2 (in [3 5])
read(5, "", 8192) = 0
read(3, "", 8192) = 0
close(5) = 0
kill(25652, SIGKILL) = 0
exit_group(0) = ?
Process 25645 detached
_
> sudo strace -p 14462
Process 14462 attached - interrupt to quit
[ Process PID=14462 runs in 32 bit mode. ]
read(0, 0xff85fdbc, 8192) = -1 EIO (Input/output error)
shutdown(3, 1 /* send */) = 0
exit_group(0) = ?
_
> sudo strace -p 7517
Process 7517 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
--- SIGSTOP (Stopped (signal)) # 0 (0) ---
[ Process PID=7517 runs in 32 bit mode. ]
connect(3, {sa_family=AF_INET, sin_port=htons(300), sin_addr=inet_addr("100.64.220.98")}, 16) = -1 ETIMEDOUT (Connection timed out)
close(3) = 0
dup(2) = 3
fcntl64(3, F_GETFL) = 0x1 (flags O_WRONLY)
close(3) = 0
write(2, "dsd13: Connection timed out\n", 30) = 30
write(2, "Error code : 110\n", 17) = 17
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1) = ?
Process 7517 detached
Not just select(), but the processes(of same program) are stuck in various system calls before I attach them to strace. They suddenly resume after attaching to strace. If I don't attach them to strace, they just hang there forever.
Update 2:
I learned that strace could start a process which was previously stopped (process in T sate). Now I am trying to understand why did these processes go to 'T' state, what's the cause. Here's the /proc//status information:
> cat /proc/12554/status
Name: someone
State: T (stopped)
SleepAVG: 88%
Tgid: 12554
Pid: 12554
PPid: 9754
TracerPid: 0
Uid: 5000 5000 5000 5000
Gid: 48986 48986 48986 48986
FDSize: 256
Groups: 9149 48986
VmPeak: 1992 kB
VmSize: 1964 kB
VmLck: 0 kB
VmHWM: 608 kB
VmRSS: 608 kB
VmData: 156 kB
VmStk: 20 kB
VmExe: 16 kB
VmLib: 1744 kB
VmPTE: 20 kB
Threads: 1
SigQ: 54/73728
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000000004000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed: 00000000,00000000,00000000,0000000f
Mems_allowed: 00000000,00000001
strace uses ptrace. The ptrace man page has this:
Since attaching sends SIGSTOP and the tracer usually suppresses it,
this may cause a stray EINTR return from the currently executing system
call in the tracee, as described in the "Signal injection and
suppression" section.
Are you seeing select return EINTR?

uwsgi long timeouts

I am using ubuntu 12, nginx, uwsgi 1.9 with socket, django 1.5.
Config:
[uwsgi]
base_path = /home/someuser/web/
module = server.manage_uwsgi
uid = www-data
gid = www-data
virtualenv = /home/someuser
master = true
vacuum = true
harakiri = 20
harakiri-verbose = true
log-x-forwarded-for = true
profiler = true
no-orphans = true
max-requests = 10000
cpu-affinity = 1
workers = 4
reload-on-as = 512
listen = 3000
Client tests from Windows7:
C:\Users\user>C:\AppServ\Apache2.2\bin\ab.exe -c 255 -n 5000 http://www.someweb.com/about/
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking www.someweb.com (be patient)
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Finished 5000 requests
Server Software: nginx
Server Hostname: www.someweb.com
Server Port: 80
Document Path: /about/
Document Length: 1881 bytes
Concurrency Level: 255
Time taken for tests: 66.669814 seconds
Complete requests: 5000
Failed requests: 1
(Connect: 1, Length: 0, Exceptions: 0)
Write errors: 0
Total transferred: 10285000 bytes
HTML transferred: 9405000 bytes
Requests per second: 75.00 [#/sec] (mean)
Time per request: 3400.161 [ms] (mean)
Time per request: 13.334 [ms] (mean, across all concurrent requests)
Transfer rate: 150.64 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 8 207.8 1 9007
Processing: 10 3380 11480.5 440 54421
Waiting: 6 1060 3396.5 271 48424
Total: 11 3389 11498.5 441 54423
Percentage of the requests served within a certain time (ms)
50% 441
66% 466
75% 499
80% 519
90% 3415
95% 36440
98% 54407
99% 54413
100% 54423 (longest request)
I have set following options too:
echo 3000 > /proc/sys/net/core/netdev_max_backlog
echo 3000 > /proc/sys/net/core/somaxconn
So,
1) I make first 3000 requests super fast. I see progress in ab and in uwsgi requests logs -
[pid: 5056|app: 0|req: 518/4997] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
[pid: 5052|app: 0|req: 512/4998] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
[pid: 5054|app: 0|req: 353/4999] 80.114.157.139 () {30 vars in 378 bytes} [Thu Mar 21 12:37:31 2013] GET /about/ => generated 1881 bytes in 4 msecs (HTTP/1.0 200) 3 headers in 105 bytes (1 switches on core 0)
I dont have any broken pipes or worker respawns.
2) Next requests are running very slow or with some timeout. Looks like that some buffer becomes full and I am waiting before it becomes empty.
3) Some buffer becomes empty.
4) ~500 requests are processed super fast.
5) Some timeout.
6) see Nr. 4
7) see Nr. 5
8) see Nr. 4
9) see Nr. 5
....
....
Need your help
check with netstat and dmesg. You have probably exhausted ephemeral ports or filled the conntrack table.