Why Does Podman Report "Not enough IDs available in namespace" with different UIDs? - centos

Facts:
Rootless podman works perfectly for uid 1480
Rootless podman fails for uid 2088
CentOS 7
Kernel 3.10.0-1062.1.2.el7.x86_64
podman version 1.4.4
Almost the entire environment has been removed between the two
The filesystem for /tmp is xfs
The capsh output of the two users is identical but for uid / username
Both UIDs have identical entries in /etc/sub{u,g}id files
The $HOME/.config/containers/storage.conf is the default and is identical between the two with the exception of the uids. The storage.conf is below for reference.
I wrote the following shell script to demonstrate just how similar an environment the two are operating in:
#!/bin/sh
for i in 1480 2088; do
sudo chroot --userspec "$i":10 / env -i /bin/sh <<EOF
echo -------------- $i ----------------
/usr/sbin/capsh --print
grep "$i" /etc/subuid /etc/subgid
mkdir /tmp/"$i"
HOME=/tmp/"$i"
export HOME
podman --root=/tmp/"$i" info > /tmp/podman."$i"
podman run --rm --root=/tmp/"$i" docker.io/library/busybox printf "\tCOMPLETE\n"
echo -----------END $i END-------------
EOF
sudo rm -rf /tmp/"$i"
done
Here's the output of the script:
$ sh /tmp/podman-fail.sh
[sudo] password for functional:
-------------- 1480 ----------------
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=1480(functional)
gid=10(wheel)
groups=0(root)
/etc/subuid:1480:100000:65536
/etc/subgid:1480:100000:65536
Trying to pull docker.io/library/busybox...Getting image source signatures
Copying blob 7c9d20b9b6cd done
Copying config 19485c79a9 done
Writing manifest to image destination
Storing signatures
ERRO[0003] could not find slirp4netns, the network namespace won't be configured: exec: "slirp4netns": executable file not found in $PATH
COMPLETE
-----------END 1480 END-------------
-------------- 2088 ----------------
Current: =
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,35,36
Securebits: 00/0x0/1'b0
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
uid=2088(broken)
gid=10(wheel)
groups=0(root)
/etc/subuid:2088:100000:65536
/etc/subgid:2088:100000:65536
Trying to pull docker.io/library/busybox...Getting image source signatures
Copying blob 7c9d20b9b6cd done
Copying config 19485c79a9 done
Writing manifest to image destination
Storing signatures
ERRO[0003] Error while applying layer: ApplyLayer exit status 1 stdout: stderr: there might not be enough IDs available in the namespace (requested 65534:65534 for /home): lchown /home: invalid argument
ERRO[0003] Error pulling image ref //busybox:latest: Error committing the finished image: error adding layer with blob "sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b": ApplyLayer exit status 1 stdout: stderr: there might not be enough IDs available in the namespace (requested 65534:65534 for /home): lchown /home: invalid argument
Failed
Error: unable to pull docker.io/library/busybox: unable to pull image: Error committing the finished image: error adding layer with blob "sha256:7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b": ApplyLayer exit status 1 stdout: stderr: there might not be enough IDs available in the namespace (requested 65534:65534 for /home): lchown /home: invalid argument
Here's the storage.conf for the 1480 uid. It's identical except s/1480/2088/:
[storage]
driver = "vfs"
runroot = "/run/user/1480"
graphroot = "/tmp/1480/.local/share/containers/storage"
[storage.options]
size = ""
remap-uids = ""
remap-gids = ""
remap-user = ""
remap-group = ""
ostree_repo = ""
skip_mount_home = ""
mount_program = ""
mountopt = ""
[storage.options.thinpool]
autoextend_percent = ""
autoextend_threshold = ""
basesize = ""
blocksize = ""
directlvm_device = ""
directlvm_device_force = ""
fs = ""
log_level = ""
min_free_space = ""
mkfsarg = ""
mountopt = ""
use_deferred_deletion = ""
use_deferred_removal = ""
xfs_nospace_max_retries = ""
You can see there's basically no difference between the two podman info outputs for the users:
$ diff -u /tmp/podman.1480 /tmp/podman.2088
--- /tmp/podman.1480 2019-10-17 22:41:21.991573733 -0400
+++ /tmp/podman.2088 2019-10-17 22:41:26.182584536 -0400
## -7,7 +7,7 ##
Distribution:
distribution: '"centos"'
version: "7"
- MemFree: 45654056960
+ MemFree: 45652697088
MemTotal: 67306323968
OCIRuntime:
package: containerd.io-1.2.6-3.3.el7.x86_64
## -24,7 +24,7 ##
kernel: 3.10.0-1062.1.2.el7.x86_64
os: linux
rootless: true
- uptime: 30h 17m 50.23s (Approximately 1.25 days)
+ uptime: 30h 17m 54.42s (Approximately 1.25 days)
registries:
blocked: null
insecure: null
## -35,14 +35,14 ##
- quay.io
- registry.centos.org
store:
- ConfigFile: /tmp/1480/.config/containers/storage.conf
+ ConfigFile: /tmp/2088/.config/containers/storage.conf
ContainerStore:
number: 0
GraphDriverName: vfs
GraphOptions: null
- GraphRoot: /tmp/1480
+ GraphRoot: /tmp/2088
GraphStatus: {}
ImageStore:
number: 0
- RunRoot: /run/user/1480
- VolumePath: /tmp/1480/volumes
+ RunRoot: /run/user/2088
+ VolumePath: /tmp/2088/volumes
I refuse to believe there's an if (2088 == uid) { abort(); } or similar nonsense somewhere in podman's source code. What am I missing?

Does podman system migrate fix there might not be enough IDs available in the namespace for you?
It did for me and others:
https://github.com/containers/libpod/issues/3421

AFAICT, sub-UID and GID ranges should not overlap between users. For reference, here is what the useradd manpage has to say about the matter:
SUB_GID_MIN (number), SUB_GID_MAX (number), SUB_GID_COUNT
(number)
If /etc/subuid exists, the commands useradd and newusers
(unless the user already have subordinate group IDs)
allocate SUB_GID_COUNT unused group IDs from the range
SUB_GID_MIN to SUB_GID_MAX for each new user.
The default values for SUB_GID_MIN, SUB_GID_MAX,
SUB_GID_COUNT are respectively 100000, 600100000 and 65536.
SUB_UID_MIN (number), SUB_UID_MAX (number), SUB_UID_COUNT
(number)
If /etc/subuid exists, the commands useradd and newusers
(unless the user already have subordinate user IDs) allocate
SUB_UID_COUNT unused user IDs from the range SUB_UID_MIN to
SUB_UID_MAX for each new user.
The default values for SUB_UID_MIN, SUB_UID_MAX,
SUB_UID_COUNT are respectively 100000, 600100000 and 65536.
The key word is unused.

CentOS 7.6 does not suport rootless buildah by default - see https://github.com/containers/buildah/pull/1166 and https://www.redhat.com/en/blog/preview-running-containers-without-root-rhel-76

Related

Celery: Routing tasks issue - only one worker consume all tasks from all queues

I've some tasks with manually configured routes and 3 workers which were configured to consume tasks from specific queue. But only one worker consuming all of the tasks and I've no idea how to fix this issue.
My celeryconfig.py
class CeleryConfig:
enable_utc = True
timezone = 'UTC'
imports = ('events.tasks')
broker_url = Config.BROKER_URL
broker_transport_options = {'visibility_timeout': 10800} # 3H
worker_hijack_root_logger = False
task_protocol = 2
task_ignore_result = True
task_publish_retry_policy = {'max_retries': 3, 'interval_start': 0, 'interval_step': 0.2, 'interval_max': 0.2}
task_time_limit = 30 # sec
task_soft_time_limit = 15 # sec
task_default_queue = 'low'
task_default_exchange = 'low'
task_default_routing_key = 'low'
task_queues = (
Queue('daily', Exchange('daily'), routing_key='daily'),
Queue('high', Exchange('high'), routing_key='high'),
Queue('normal', Exchange('normal'), routing_key='normal'),
Queue('low', Exchange('low'), routing_key='low'),
Queue('service', Exchange('service'), routing_key='service'),
Queue('award', Exchange('award'), routing_key='award'),
)
task_route = {
# -- SCHEDULE QUEUE --
base_path.format(task='refresh_rank'): {'queue': 'daily'}
# -- HIGH QUEUE --
base_path.format(task='execute_order'): {'queue': 'high'},
# -- NORMAL QUEUE --
base_path.format(task='calculate_cost'): {'queue': 'normal'},
# -- SERVICE QUEUE --
base_path.format(task='send_pin'): {'queue': 'service'},
# -- LOW QUEUE
base_path.format(task='invite_to_tournament'): {'queue': 'low'},
# -- AWARD QUEUE
base_path.format(task='get_lesson_award'): {'queue': 'award'},
# -- TEST TASK
worker_concurrency = multiprocessing.cpu_count() * 2 + 1
worker_prefetch_multiplier = 1 #
worker_max_tasks_per_child = 1
worker_max_memory_per_child = 90000 # 90MB
beat_max_loop_interval = 60 * 5 # 5 min
I run workers in a docker, part of my stack.yml
version: "3.7"
services:
worker_high:
command: celery worker -l debug -A runcelery.celery -Q high -n worker.high#%h
worker_normal:
command: celery worker -l debug -A runcelery.celery -Q normal,award,service,low -n worker.normal#%h
worker_schedule:
command: celery worker -l debug -A runcelery.celery -Q daily -n worker.schedule#%h
beat:
command: celery beat -l debug -A runcelery.celery
flower:
command: flower -l debug -A runcelery.celery --port=5555
broker:
image: redis:5.0-alpine
I thought that my config is right and run command correct too, but docker logs and flower shown that only worker.normal consume all tasks.
I
Update
Here is part of task.py:
def refresh_rank_in_tournaments():
logger.debug(f'Start task refresh_rank_in_tournaments')
return AnalyticBackgroundManager.refresh_tournaments_rank()
base_path is shortcut for full task path:
base_path = 'events.tasks.{task}'
execute_order task code:
#celery.task(bind=True, default_retry_delay=5)
def execute_order(self, private_id, **kwargs):
try:
return OrderBackgroundManager.execute_order(private_id, **kwargs)
except IEXException as exc:
raise self.retry(exc=exc)
This task will call in a view as tasks.execute_order.delay(id)
Your worker.normal is subscribed to the normal,award,service,low queues. Furthermore, the low queue is the default one, so every task that does not have explicitly set queue will be executed on worker.normal.

AWS Glue job failing with OOM exception when changing column names

I have an ETL job where I load some data from S3 into a dynamic frame, relationalize it, and iterate through the dynamic frames returned. I want to query the result of this in Athena later so I want to change the names of the columns from having '.' to '_' and lower case them. When I do this transformation, I change the DynamicFrame into a spark dataframe and have been doing it this way. I've also seen a problem in another SO question where it turned out there is a reported problem with AWS Glue rename field transform so I've stayed away from that.
I've tried a couple things, including adding a load limit size to 50MB, repartitioning the dataframe, using both dataframe.schema.names and dataframe.columns, using reduce instead of loops, using sparksql to change it and nothing has worked. I'm fairly certain that its this transformation that failing because I've put some print statements in and the print that I have right after the completion of this transformation never shows up. I used a UDF at one point but that also failed. I've tried the actual transformation using df.toDF(new_column_names) and df.withColumnRenamed() but it never gets this far because I've not seen it get past retrieving the column names. Here's the code I've been using. I've been changing the actual name transformation as I said above, but the rest of it has stayed pretty much the same.
I've seen some people try and use the spark.executor.memory, spark.driver.memory, spark.executor.memoryOverhead and spark.driver.memoryOverhead. I've used those and set them to the most AWS Glue will let you but to no avail.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace
import copy
import json
import boto3
import botocore
import time
# ========================================================
# UTILITY FUNCTIONS
# ========================================================
def lower_and_pythonize(s=None):
if s is not None:
return s.replace('.', '_').lower()
else:
return None
# pyspark implementation of renaming
# exprs = [
# regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c)
# for (c, t) in data_frame.dtypes
# ]
# ========================================================
# END UTILITY FUNCTIONS
# ========================================================
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
#my params
bucket_name = '<my-s3-bucket>' # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>' # key where all of the output is saved
input_keys = ['<root-directory-i'm using'] # highest level key that holds all of the desired data
s3_exclusions = "[\"*.orc\"]" # list of strings to exclude. Documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root' # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys] # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1') # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1') # s3 resource used for checking if key exists
group_mb = 50 # NOTE: 75 has proven to be too much when running on all of the april data
group_size = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths,
'groupFiles': 'inPartition',
'groupSize': group_size,
'recurse': True,
'exclusions': s3_exclusions} # dict of options. Documentation link found above the create_dynamic_frame_from_options line
print(sc._conf.get('spark.executor.cores'))
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4
print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))
# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time = time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))
df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))
keys = dfc.keys()
keys.sort(key=lambda s: len(s))
print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
good_key = lower_and_pythonize(s=key)
data_frame = dfc.select(key).toDF()
# lowercase all the names and remove '.'
print('Removing . and _ from names for {} frame...'.format(key))
df_fix_names_time = time.time()
print('Repartitioning data frame...')
data_frame.repartition(num_paritions)
print('Done.')
#
print('Changing names...')
for old_name in data_frame.schema.names:
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
print('Done.')
#
df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))
# if a conflict of types appears, make it 2 columns
# https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
print('Fixing any type conficts for {} frame...'.format(key))
df_resolve_time = time.time()
resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))
# check if key exists in s3. if not make one
out_connect = copy.deepcopy(output_connection_opts)
out_connect['path'] = out_connect['path'] + '/' + str(good_key)
try:
s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
# object doesn't exist
s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
else:
print(e)
## https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
print('Writing {} frame to S3...'.format(key))
df_writing_time = time.time()
datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
out_connect = None
datasink4 = None
print('Done. Time to complete: {}'.format(time.time() - df_writing_time))
print('Done. Time to complete: {}s'.format(time.time() - writting_time))
job.commit()
Here is the error I'm getting
19/06/07 16:33:36 DEBUG Client:
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2019-06-07-15-29-50.py' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python runscript.py script_2019-06-07-15-29-50.py --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file runscript.py --arg script_2019-06-07-15-29-50.py --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root
Here are the log contents from the job
LogType:stdout
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
LogLength:487
Log Contents:
4
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
As I said earlier, the Done. print after changing the names never appears in the logs. I've seen plenty of people getting the same error I'm seeing and I've tried a fair bit of them with no success. Any help you can provide would b e much appreciated. Let me know if you need any more information. Thanks
Edit
Prabhakar's comment reminded me that I have tried the memory worker type in AWS Glue and it still failed. As stated above, I have tried raising the amount of memory in the memoryOverhead from 5 to 12, but to avail. Neither of these made the job complete successfully
Update
I put in the following code for column name change instead of the above code for easier debugging
print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
print('Name number {}. name being changed: {}'.format(name_counter, old_name))
data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
name_counter += 1
print('Done.')
And I got the following output
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout
So it must be a problem with the data_frame.schema.names part. Could it be this line with my loop through all of the DynamicFrames? Am I looping through the DynamicFrames from the relationalize transformation correctly?
Update 2
Glue recently added more verbose logs and I found this
ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
This happens for more than just this executor too; it looks like almost all of them.
I can try to increase the executor memory overhead, but I would like to know why getting the column names results in an OOM error. I wouldn't think that something that trivial would take up that much memory?
Update
I attempted to run the job with both spark.driver.memoryOverhead=7g and spark.yarn.executor.memoryOverhead=7g and I again got an OOM error

Audit daemon does not take rules from audit.rules

I am unable to add rules to audit daemon using /etc/audit/audit.rules
Every time i add the rules using auditctl it gets removed on reboot or audit daemon restart I have attached the /etc/audit/audit.rules and /etc/audit/auditd.conf
cat /etc/audit/auditd.conf
$ cat /etc/audit/auditd.conf
#
# This file controls the configuration of the audit daemon
#
local_events = yes
write_logs = yes
log_file = /NU_Application/audit.log
log_group = root
log_format = RAW
flush = INCREMENTAL_ASYNC
freq = 50
max_log_file = 8
num_logs = 5
priority_boost = 4
disp_qos = lossy
dispatcher = /sbin/audispd
name_format = NONE
##name = mydomain
max_log_file_action = ROTATE
space_left = 75
space_left_action = SYSLOG
verify_email = yes
action_mail_acct = root
admin_space_left = 50
admin_space_left_action = SUSPEND
disk_full_action = SUSPEND
disk_error_action = SUSPEND
use_libwrap = yes
##tcp_listen_port = 22
tcp_listen_queue = 5
tcp_max_per_addr = 1
##tcp_client_ports = 1024-65535
tcp_client_max_idle = 0
enable_krb5 = no
krb5_principal = auditd
##krb5_key_file = /etc/audit/audit.key
distribute_network = no
cat /etc/audit/audit.rules
$ cat /etc/audit/audit.rules
## First rule - delete all
## Increase the buffers to survive stress events.
## Make this bigger for busy systems
-b 8192
## This determine how long to wait in burst of events
--backlog_wait_time 0
## Set failure mode to syslog
-f 1
-w /var/log/lastlog -p wa
root#iWave-G22M:~# auditctl
When i restart the audit daemon ( i.e /etc/init.d/auditd restart ) and try to list the rules i get the message No rules
$ /etc/init.d/auditd restart
Restarting audit daemon auditd
type=1305 audit(1558188111.980:3): audit_pid=0 old=1148 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.010:4): audit_enabled=1 old=1 auid=4294967295 ses=4294967295
res=1
type=1305 audit(1558188112.020:5): audit_pid=30342 old=0 auid=4294967295 ses=4294967295
res=1
1
$ auditctl -l
No rules
OS INFO
$ uname -a
Linux iWave-G22M 3.10.31-ltsi-svn743 #5 SMP PREEMPT Mon May 27 18:28:01 IST 2019 armv7l GNU/Linux
audit_2.8.4.bb file was used to install auditd daemon via yocto
path of audit_2.8.4.bb -- http://git.yoctoproject.org/cgit/cgit.cgi/meta-selinux/tree/recipes-security/audit/audit_2.8.4.bb?h=master
audit rules add via /etc/audit/audit.rules and auditctl command are not permanent. to make them permanent across reboot you have to add them /etc/audit/rules.d/audit.rules file.
after adding the rule, restart auditd service and run command auditctl -l, it will list all the rules and also reflect in /etc/audit/audit.rules file.

snmpget “No such object available on this agent at this OID”

I have been working on trying to get a value with a custom MIB through SNMP following this tutorial. I'm using net-snmp 5.7.3. I have used this MIB along with the corresponding codes with this.
My setup: I have two VM's, both Ubuntu 16, one is snmp-server with IP:192.168.5.20 and the other snmp-agent with IP:192.168.5.21.
I did MIBS=ALL and export MIBS. On the agent, I have successfully compiled my MIB and got the expected output. My code files are in the path /home/snmp-agent/Desktop/proj/net-snmp-5.7.3/agent/mibgroup/. I reconfigure mthe net-snmp like this:
./configure --with-mib-modules="pingCtlTable pingResultsTable pingProbeHistoryTable"
make
make install
Everything worked fine. And when I do snmptranslate on the agent I get:
root#snmpagent:~# snmptranslate -On DISMAN-PING-MIB::pingResultsTable
.1.3.6.1.2.1.80.1.3
root#snmpagent:~# snmptranslate -On DISMAN-PING-MIB::pingCtlTable
.1.3.6.1.2.1.80.1.2
root#snmpagent:~# snmptranslate -On DISMAN-PING-MIB::pingProbeHistoryTable
.1.3.6.1.2.1.80.1.4
Output on the server, when I do this:
snmpget -v2c -c public 192.168.5.21 sysDescr.0
SNMPv2-MIB::sysDescr.0 = STRING: Linux snmp-agent 4.10.0-33-generic #37~16.04.1-Ubuntu SMP Fri Aug 11 14:07:24 UTC 2017 x86_64
But when I run snmpget, from the server, I get this error for all the three:
snmpget -v2c -c public 192.168.5.21 DISMAN-PING-MIB::pingCtlTable.0
DISMAN-PING-MIB::pingCtlTable = No Such Object available on this agent at this OID
Output when specified the OID's:
snmpget -v2c -c public 192.168.5.21 .1.3.6.1.2.1.80.1.2
DISMAN-PING-MIB::pingCtlTable = No Such Object available on this agent at this OID
NOTE: I have already seen these posts with no luck for me.
add new mib master agent
snmpget-no-such-object... and snmpget-returns-no-such-object...
UPDATE 2:
When I do snmpwalk from the server, I get:
root#snmp-server:# snmp-server:# snmpwalk -v 2c -c ncs -m DISMAN-PING-MIB 192.168.5.22 .1.3.6.1.2.1.80
DISMAN-PING-MIB::pingObjects.0 = INTEGER: 1
DISMAN-PING-MIB::pingFullCompliance.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = STRING: "/bin/echo"
DISMAN-PING-MIB::pingMinimumCompliance.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = ""
DISMAN-PING-MIB::pingCompliances.4.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = ""
DISMAN-PING-MIB::pingCompliances.5.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 5
DISMAN-PING-MIB::pingCompliances.6.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 1
DISMAN-PING-MIB::pingCompliances.7.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 1
DISMAN-PING-MIB::pingCompliances.20.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 4
DISMAN-PING-MIB::pingCompliances.21.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 1
DISMAN-PING-MIB::pingIcmpEcho.1.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = ""
DISMAN-PING-MIB::pingIcmpEcho.2.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = ""
DISMAN-PING-MIB::pingIcmpEcho.3.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 1
DISMAN-PING-MIB::pingIcmpEcho.4.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = INTEGER: 0
DISMAN-PING-MIB::pingMIB.4.1.2.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48.1 = ""
So accordingly when I follow the output of snmpwalk, I get:
root#snmp-server:# snmpget 192.168.5.22 DISMAN-PING-MIB::pingFullCompliance.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48
DISMAN-PING-MIB::pingFullCompliance.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 = Wrong Type (should be INTEGER): STRING: "/bin/echo"
I don't understand why I have such a long OID? What is pingFullCompliance.15.46.49.46.51.46.54.46.49.46.50.46.49.46.56.48 ? Where am I going wrong? Can anyone please point me in the right direction... Thank you.

SNMP on Linux (Centos) not showing load average

I installed SNMP on my Linux in order to monitory load balance, disk space etc.
yum install net-snmp net-snmp-utils
vim /etc/snmp/snmpd.conf # added "rocommunity public"
/etc/init.d/snmpd start
Walking the SNMP returns very few entries, no load average, disk space etc.:
snmpwalk -v 1 -c public localhost
SNMPv2-MIB::sysDescr.0 = STRING: Linux XXX 2.6.35.4-rscloud #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (232884) 0:38:48.84
SNMPv2-MIB::sysContact.0 = STRING: Root <root#localhost> (configure /etc/snmp/snmp.local.conf)
SNMPv2-MIB::sysName.0 = STRING: XXX
SNMPv2-MIB::sysLocation.0 = STRING: Unknown (edit /etc/snmp/snmpd.conf)
SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORID.1 = OID: SNMPv2-MIB::snmpMIB
SNMPv2-MIB::sysORID.2 = OID: TCP-MIB::tcpMIB
SNMPv2-MIB::sysORID.3 = OID: IP-MIB::ip
SNMPv2-MIB::sysORID.4 = OID: UDP-MIB::udpMIB
SNMPv2-MIB::sysORID.5 = OID: SNMP-VIEW-BASED-ACM-MIB::vacmBasicGroup
SNMPv2-MIB::sysORID.6 = OID: SNMP-FRAMEWORK-MIB::snmpFrameworkMIBCompliance
SNMPv2-MIB::sysORID.7 = OID: SNMP-MPD-MIB::snmpMPDCompliance
SNMPv2-MIB::sysORID.8 = OID: SNMP-USER-BASED-SM-MIB::usmMIBCompliance
SNMPv2-MIB::sysORDescr.1 = STRING: The MIB module for SNMPv2 entities
SNMPv2-MIB::sysORDescr.2 = STRING: The MIB module for managing TCP implementations
SNMPv2-MIB::sysORDescr.3 = STRING: The MIB module for managing IP and ICMP implementations
SNMPv2-MIB::sysORDescr.4 = STRING: The MIB module for managing UDP implementations
SNMPv2-MIB::sysORDescr.5 = STRING: View-based Access Control Model for SNMP.
SNMPv2-MIB::sysORDescr.6 = STRING: The SNMP Management Architecture MIB.
SNMPv2-MIB::sysORDescr.7 = STRING: The MIB for Message Processing and Dispatching.
SNMPv2-MIB::sysORDescr.8 = STRING: The management information definitions for the SNMP User-based Security Model.
SNMPv2-MIB::sysORUpTime.1 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.2 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.3 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.4 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.5 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.6 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.7 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORUpTime.8 = Timeticks: (0) 0:00:00.00
Found the problem:
needed to reset the .conf file completely:
echo "rocommunity public" > /etc/snmp/snmp.conf
needed to specify the exact branch to walk:
snmpwalk -v 1 -c public localhost .1.3.6.1.4.1.2021.10
For others like me who might search for this:
It is due to the rocommunity being restricted to basic access in the default SNMPd config with the line:
rocommunity public default -V systemonly
To give full access to localhost just uncomment the previous line which is:
rocommunity public localhost
Or you can give all computers full access like OP did via:
rocommunity public