pyspark conf and yarn top memory discrepancies - pyspark

An EMR cluster reads (from main node, after running yarn top):
ARN top - 13:27:57, up 0d, 1:34, 1 active users, queue(s): root
NodeManager(s): 6 total, 6 active, 0 unhealthy, 2 decommissioned, 0
lost, 0 rebooted Queue(s) Applications: 3 running, 8 submitted, 0
pending, 5 completed, 0 killed, 0 failed Queue(s) Mem(GB): 18
available, 189 allocated, 1555 pending, 0 reserved Queue(s) VCores: 44
available, 20 allocated, 132 pending, 0 reserved Queue(s) Containers:
20 allocated, 132 pending, 0 reserved
APPLICATIONID USER TYPE QUEUE PRIOR #CONT #RCONT VCORES RVCORES MEM RMEM VCORESECS MEMSECS %PROGR TIME NAME
application_1663674823778_0002 hadoop spark default 0 10 0 10 0 99G 0G 18754 187254 10.00 00:00:33 PyS
application_1663674823778_0003 hadoop spark default 0 9 0 9 0 88G 0G 9446 84580 10.00 00:00:32 PyS
application_1663674823778_0008 hadoop spark default 0 1 0 1 0 0G 0G 382 334 10.00 00:00:06 PyS
Note that the PySpark apps for application_1663674823778_0002 and application_1663674823778_0003 were provisioned via the main node command line with just executing pyspark (with no explicit config editing).
However, the application_1663674823778_0008 was provisioned via the following command: pyspark --conf spark.executor.memory=11g --conf spark.driver.memory=12g. Despite this (test) PySpark config customization, that app in yarn fails to show anything other than 0 for the memory (regular or reserved) value.
Why is this?

Related

ceph raw used is more than sum of used in all pools (ceph df detail)

First of all sorry for my poor English
In my ceph cluster, when i run the ceph df detail command it shows me like as following result
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 62 TiB 52 TiB 10 TiB 10 TiB 16.47
ssd 8.7 TiB 8.4 TiB 370 GiB 377 GiB 4.22
TOTAL 71 TiB 60 TiB 11 TiB 11 TiB 14.96
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
rbd-kubernetes 36 288 GiB 71.56k 865 GiB 1.73 16 TiB N/A N/A 71.56k 0 B 0 B
rbd-cache 41 2.4 GiB 208.09k 7.2 GiB 0.09 2.6 TiB N/A N/A 205.39k 0 B 0 B
cephfs-metadata 51 529 MiB 221 1.6 GiB 0 16 TiB N/A N/A 221 0 B 0 B
cephfs-data 52 1.0 GiB 424 3.1 GiB 0 16 TiB N/A N/A 424 0 B 0 B
So i have a question about the result
As you can see, sum of my pools used storage is less than 1 TB, But in RAW STORAGE section the used from HDD hard disks is 10TB and it is growing every day.I think this is unusual and something is wrong with this CEPH cluster.
And also FYI the output of ceph osd dump | grep replicated is
pool 36 'rbd-kubernetes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 244 pg_num_target 64 pgp_num_target 64 last_change 1376476 lfor 2193/2193/2193 flags hashpspool,selfmanaged_snaps,creating tiers 41 read_tier 41 write_tier 41 stripe_width 0 application rbd
pool 41 'rbd-cache' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 1376476 lfor 2193/2193/2193 flags hashpspool,incomplete_clones,selfmanaged_snaps,creating tier_of 36 cache_mode writeback target_bytes 1000000000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 decay_rate 0 search_last_n 0 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
pool 51 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 31675 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 52 'cephfs-data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 742334 flags hashpspool,selfmanaged_snaps stripe_width 0 application cephfs
Ceph Version ceph -v
ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Ceph OSD versions ceph tell osd.* version return for all OSDs like
osd.0: {
"version": "ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)"
}
Ceph status ceph -s
cluster:
id: 6a86aee0-3171-4824-98f3-2b5761b09feb
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-sn-03,ceph-sn-02,ceph-sn-01 (age 37h)
mgr: ceph-sn-01(active, since 4d), standbys: ceph-sn-03, ceph-sn-02
mds: cephfs-shared:1 {0=ceph-sn-02=up:active} 2 up:standby
osd: 63 osds: 63 up (since 41h), 63 in (since 41h)
task status:
scrub status:
mds.ceph-sn-02: idle
data:
pools: 4 pools, 384 pgs
objects: 280.29k objects, 293 GiB
usage: 11 TiB used, 60 TiB / 71 TiB avail
pgs: 384 active+clean
According to the provided data, you should evaluate the following considerations and scenarios:
The replication size is inclusive, and once the min_size is achieved in a write operation, you receive a completion message. That means you should expect storage consumption with the minimum of min_size and maximum of the replication size.
Ceph stores metadata and logs for housekeeping purposes, obviously consuming storage.
If you do benchmark operation via "rados bench" or a similar interface with the --no-cleanup parameter, objects will be permanently stored within the cluster that consumes storage.
All the mentioned scenarios are a couple of possibilities.

Rookio Ceph cluster : mon c is low on available space message

I have setup RookIO 1.4 cluster in Kubernetes 1.18. with 3 nodes allocated 1TB storage on each of them.
after creating cluster. when I run the ceph status cluster status shows as HEALTH_WARN with mon c is low on available space.
There is no data stored yet. why status how low on available space? How to clear this error?
[root#rook-ceph-tools-6bdcd78654-sfjvl /]# ceph status
cluster:
id: ad42764d-aa28-4da5-a828-2d87205aff08
health: HEALTH_WARN
mon c is low on available space
services:
mon: 3 daemons, quorum a,b,c (age 37m)
mgr: a(active, since 36m)
osd: 3 osds: 3 up (since 37m), 3 in (since 37m)
data:
pools: 1 pools, 1 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 3.6 TiB / 3.6 TiB avail
pgs: 1 active+clean
All three node has same size storage:
sdb 8:16 0 1.2T 0 disk
└─ceph--a6cd601d--7584--4b1f--bf82--48c95437f351-osd--data--ae1bc856--8ded--4b1e--8c87--30ca0f0959a3 253:3 0 1.2T 0 lvm
sdb 8:16 0 1.2T 0 disk
└─ceph--ccaf7144--d6a0--441c--bcd5--6a09d056bd7a-osd--data--36a9b28c--7207--400a--936b--edfb3255ce0b 253:3 0 1.2T 0 lvm
sdb 8:16 0 1.2T 0 disk
└─ceph--53e9b8a9--8925--4b21--a6ea--f8e17a322d5c-osd--data--6b1e779c--a18a--4e4d--960e--73ca9473d02f 253:3 0 1.2T 0 lvm
Thanks
SR
This alert is for your monitor disk space that is stored normally in /var/lib/ceph/mon. This path is stored in root fs that isn't related to your OSDs block device. This warn is raised when this path has less than 30% available space (see mon_data_avail_warn which is 30 by default).
You can change it to ignore alert or resize that path to have more space for its RocksDB data.
As Seena explained, it was because the available space is less than 30%, in this case, you could compact the mon data by the command as follow.
ceph tell mon.`hostname -s` compact
There is another way to trigger the data compaction for mon, add the mon config to the ceph.conf, and then restart the mon.
[mon]
mon compact on start = true

joblib Parallel running out of memory

I have something like this
outputs = Parallel(n_jobs=12, verbose=10)(delayed(_process_article)(article, config) for article in data)
Case 1: Run on ubuntu with 80 cores:
CPU(s): 80
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
There are a total of 90,000 tasks. At around 67k it fails and is terminated.
joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore.
When I monitor the top at 67k I see a sharp fall in the memory
top - 11:40:25 up 2 days, 18:35, 4 users, load average: 7.09, 7.56, 7.13
Tasks: 32 total, 3 running, 29 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.6 us, 2.6 sy, 0.0 ni, 89.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 33554432 total, 40 free, 33520996 used, 33396 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 40 avail Mem
Case 2: Mac with 8 cores
hw.physicalcpu: 4
hw.logicalcpu: 8
But on the mac it is much much slower .. And surprisingly it does not get killed at 67k..
Additionally, I reduced the parallelism (in case 1) to 2,4 and it still fails :(
Why is this happening? Has anyone faced this issue before and has a fix?
Note: when I run for 50,000 tasks it runs well and does not give any problems.
Thank you!
Got a machine with an increased memory of 128GB and that solved the problem!

Why my new Ceph cluster status never shows 'HEALTH_OK'?

I'm working on setup a Ceph cluster with Docker and image 'ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7'. But after the cluster has been setup, the ceph status command never reaches HEALTH_OK. Here is my cluster's information. It has enough disk space and the network is all right.
My question are:
Why does Ceph not replicate the 'undersized' pages?
How to fix it?
Thank you very much!
➜ ~ ceph -s
cluster:
id: 483a61c4-d3c7-424d-b96b-311d2c6eb69b
health: HEALTH_WARN
Degraded data redundancy: 3 pgs undersized
services:
mon: 3 daemons, quorum pc-10-10-0-13,pc-10-10-0-89,pc-10-10-0-160
mgr: pc-10-10-0-89(active), standbys: pc-10-10-0-13, pc-10-10-0-160
mds: cephfs-1/1/1 up {0=pc-10-10-0-160=up:active}, 2 up:standby
osd: 5 osds: 5 up, 5 in
rbd-mirror: 3 daemons active
rgw: 3 daemons active
data:
pools: 6 pools, 68 pgs
objects: 212 objects, 5.27KiB
usage: 5.02GiB used, 12.7TiB / 12.7TiB avail
pgs: 65 active+clean
3 active+undersized
➜ ~ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 12.73497 root default
-5 0.90959 host pc-10-10-0-13
3 hdd 0.90959 osd.3 up 1.00000 1.00000
-7 0.90959 host pc-10-10-0-160
4 hdd 0.90959 osd.4 up 1.00000 1.00000
-3 10.91579 host pc-10-10-0-89
0 hdd 3.63860 osd.0 up 1.00000 1.00000
1 hdd 3.63860 osd.1 up 1.00000 1.00000
2 hdd 3.63860 osd.2 up 1.00000 1.00000
➜ ~ ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 flags hashpspool stripe_width 0 application cephfs
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 24 flags hashpspool stripe_width 0 application cephfs
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 27 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 30 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 32 owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 13 pgp_num 13 last_change 34 flags hashpspool stripe_width 0 application rgw
#itsafire This is not the solution. He is asking for solution not asking for hardware recommendation.
I'm running 8 nodes and 5 nodes multiple CEPH clusters. I always use 2 replica with multiple crush map (for SSD, SAS and 72k drives)
Why you need 3 replica if you are using a small cluster with limited resources.
Could you please explain why my solution is Recipe for disaster? You have good reputation and I'm not sure how did you get them. Maybe just replying recommendation not solution.
Create a new Pool with Size 2 and Min Size 1.
For pg-num use Ceph PG Calculator https://ceph.com/pgcalc/
It seems you created a three node cluster with different osd configurations and sizes. The standard crush rule tells ceph to have 3 copies of a PG on different hosts. If there is not enough space to spread the PGs over the three hosts, then your cluster will never be healthy.
It is always a good idea to start with a set of equally sized hosts (RAM, CPU, OSDs).
Update for discussion about cluster with size of 2 vs 3
Don't use 2 replicas. Go for 3. Ceph started out with a size default of 2. But this was changed to 3 in Ceph 0.82 (Firefly release).
Why ? Because if one drive fails you are left with only one drive containing your data. Should this drive fail too while recovery is running, then your data is gone for good.
See this thread on the ceph user mailing list
2 replicas isn't safe, no matter how big or small the cluster is. With
disks becoming larger recovery times will grow. In that window you don't
want to run on a single replica.

ZooKeeper node counter?

I have a cluster of ZooKeeper with just 2 nodes, each zoo.conf has the following
# Servers
server.1=10.138.0.8:2888:3888
server.2=10.138.0.9:2888:3888
the same two lines are present in both configs
[root#zk1-prod supervisor.d]# echo mntr | nc 10.138.0.8 2181
zk_version 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 5
zk_packets_sent 4
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 4
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 27
zk_open_file_descriptor_count 28
zk_max_file_descriptor_count 4096
[root#zk1-prod supervisor.d]# echo mntr | nc 10.138.0.9 2181
zk_version 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 3
zk_packets_sent 2
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count 4
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 27
zk_open_file_descriptor_count 29
zk_max_file_descriptor_count 4096
zk_followers 1
zk_synced_followers 1
zk_pending_syncs 0
so why zk_znode_count == 4 ?
Znodes are not Zookeeper servers.
From Hadoop Definitive Guide:
Zookeeper doesn’t have files and directories, but a unified concept of
a node, called a znode, which acts both as a container of data (like a
file) and a container of other znodes (like a directory).
zk_znode_count refers to number of znodes available in that Zookeeper server. In your ZK ensemble, each server has four znodes.