I try to run simple spring batch task on Spring Cloud Data Flow for Yarn. Unfortunatelly while running it I got error message in ResourceManager UI:
Application application_1473838120587_5156 failed 1 times due to AM Container for appattempt_1473838120587_5156_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://ip-10-249-9-50.gc.stepstone.com:8088/cluster/app/application_1473838120587_5156Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1473838120587_5156_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
More informations from Appmaster.stderror stated that:
Log Type: Appmaster.stderr
Log Upload Time: Mon Nov 07 12:59:57 +0000 2016
Log Length: 106
Error: Unable to access jarfile spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.BUILD-SNAPSHOT.jar
If it comes to Spring Cloud Data Flow I'm trying to run in dataflow-shell:
app register --type task --name simple_batch_job --uri https://github.com/spring-cloud/spring-cloud-dataflow-samples/raw/master/tasks/simple-batch-job/batch-job-1.0.0.BUILD-SNAPSHOT.jar
task create foo --definition "simple_batch_job"
task launch foo
Its really hard to know why this error occurs. I'm sure that connection from dataflow-server to yarn works fine because in standard HDFS localization (/dataflow) some files was copied (servers.yml, jars with jobs and utilities) but it is unaccessible in some way.
My servers.yml config:
logging:
level:
org.apache.hadoop: DEBUG
org.springframework.yarn: DEBUG
maven:
remoteRepositories:
springRepo:
url: https://repo.spring.io/libs-snapshot
spring:
main:
show_banner: false
hadoop:
fsUri: hdfs://HOST:8020
resourceManagerHost: HOST
resourceManagerPort: 8032
resourceManagerSchedulerAddress: HOST:8030
datasource:
url: jdbc:h2:tcp://localhost:19092/mem:dataflow
username: sa
password:
driverClassName: org.h2.Driver
I'll be glad to hear any informations or spring-yarn tips&tricks to make this work.
PS: As hadoop environment I use Amazon EMR 5.0
EDIT: Recursive path from hdfs:
drwxrwxrwx - user hadoop 0 2016-11-07 15:02 /dataflow/apps
drwxrwxrwx - user hadoop 0 2016-11-07 15:02 /dataflow/apps/stream
drwxrwxrwx - user hadoop 0 2016-11-07 15:04 /dataflow/apps/stream/app
-rwxrwxrwx 3 user hadoop 121 2016-11-07 15:05 /dataflow/apps/stream/app/application.properties
-rwxrwxrwx 3 user hadoop 1177 2016-11-07 15:04 /dataflow/apps/stream/app/servers.yml
-rwxrwxrwx 3 user hadoop 60202852 2016-11-07 15:04 /dataflow/apps/stream/app/spring-cloud-deployer-yarn-appdeployerappmaster-1.0.0.RELEASE.jar
drwxrwxrwx - user hadoop 0 2016-11-04 14:22 /dataflow/apps/task
drwxrwxrwx - user hadoop 0 2016-11-04 14:24 /dataflow/apps/task/app
-rwxrwxrwx 3 user hadoop 121 2016-11-04 14:25 /dataflow/apps/task/app/application.properties
-rwxrwxrwx 3 user hadoop 2101 2016-11-04 14:24 /dataflow/apps/task/app/servers.yml
-rwxrwxrwx 3 user hadoop 60198804 2016-11-04 14:24 /dataflow/apps/task/app/spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.RELEASE.jar
drwxrwxrwx - user hadoop 0 2016-11-04 14:25 /dataflow/artifacts
drwxrwxrwx - user hadoop 0 2016-11-07 15:06 /dataflow/artifacts/cache
-rwxrwxrwx 3 user hadoop 12323493 2016-11-04 14:25 /dataflow/artifacts/cache/https-c84ea9dc0103a4754aeb9a28bbc7a4f33c835854-batch-job-1.0.0.BUILD-SNAPSHOT.jar
-rwxrwxrwx 3 user hadoop 22139318 2016-11-07 15:07 /dataflow/artifacts/cache/log-sink-rabbit-1.0.0.BUILD-SNAPSHOT.jar
-rwxrwxrwx 3 user hadoop 12590921 2016-11-07 12:59 /dataflow/artifacts/cache/timestamp-task-1.0.0.BUILD-SNAPSHOT.jar
There's clearly a mix of wrong versions as hdfs has spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.RELEASE.jar and error complains about spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.BUILD-SNAPSHOT.jar.
Not sure how you got snapshots unless you built distribution manually?
I'd recommend picking 1.0.2 from http://cloud.spring.io/spring-cloud-dataflow-server-yarn. See "Download and Extract Distribution" from ref docs. Also delete old /dataflow directory from hdfs.
Related
We have installed postgres v12 on ubuntu 20.04 (with apt install -y postgresql postgresql-contrib) and wish to enable archiving to /data/db/postgres/archive by setting the following in postgresql.conf:
max_wal_senders=2
wal_keep_segments=256
wal_sender_timeout=60s
archive_mode=on
archive_command=cp %p /data/db/postgres/archive/%f
However the postgres service fails to write there:
2022-11-15 15:02:26.212 CET [392860] FATAL: archive command failed with exit code 126
2022-11-15 15:02:26.212 CET [392860] DETAIL: The failed archive command was: archive_command=cp pg_wal/000000010000000000000002 /data/db/postgres/archive/000000010000000000000002
2022-11-15 15:02:26.213 CET [392605] LOG: archiver process (PID 392860) exited with exit code 1
sh: 1: pg_wal/000000010000000000000002: Permission denied
This directory /data/db/postgres/archive/ is owned by the postgres user and when we su postgres we are able to create and delete files without a problem.
Why can the postgresql service (running as postgres) not write to a directory it owns?
Here are the permissions on all the parents of the archive directory:
drwxr-xr-x 2 postgres root 6 Nov 15 14:59 /data/db/postgres/archive
drwxr-xr-x 3 root root 21 Nov 15 14:29 /data/db/postgres
drwxr-xr-x 3 root root 22 Nov 15 14:29 /data/db
drwxr-xr-x 5 root root 44 Nov 15 14:29 /data
2022-11-15 15:02:26.212 CET [392860] DETAIL: The failed archive command was: archive_command=cp pg_wal/000000010000000000000002 /data/db/postgres/archive/000000010000000000000002
So, your archive_command is apparently set to the peculiar string archive_command=cp %p /data/db/postgres/archive/%f.
After the %variables are substituted, the result is passed to the shell. The shell does what it was told, which is to set the (unused) environment variable 'archive_command' to be 'cp', and then tries to execute the file pg_wal/000000010000000000000002, which is not allowed to because it doesn't have the execute bit set.
I don't know how you managed to get such a deformed archive_command, but it didn't come from anything you showed us.
I am trying to configure a fluend to send logs to an elasticsearch. After configuring it, I could not see any pod logs in the elasticsearch.
While debuging what is happening, I have seen that there are no logs in the node in path var/log/pods:
cd var/logs/pods
ls -la
drwxr-xr-x. 34 root root 8192 Dec 9 12:26 .
drwxr-xr-x. 14 root root 4096 Dec 9 02:21 ..
drwxr-xr-x. 3 root root 21 Dec 7 03:14 pod1
drwxr-xr-x. 6 root root 119 Dec 7 11:17 pod2
cd pod1/containerName
ls -la
total 0
drwxr-xr-x. 2 root root 6 Dec 7 03:14 .
drwxr-xr-x. 3 root root 21 Dec 7 03:14 ..
But I can see the logs when executing kubectl logs pod1
As I have seen in the documentation logs should be in this path. Do you have any idea why there are no logs stored in the node?
I have found what was happening. The problem was related with the log driver. It was configured to send the logs to journald:
docker inspect -f '{{.HostConfig.LogConfig.Type}}' ID
journald
I have changed it to json-file. Now it is writing logs var/log/pods
Here there are the different logging configuration options
I am trying to run Kafka with mounted NFS Volume, facing exception and can not start Kafka:
[2020-03-15 09:36:11,580] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.KafkaException: Found directory /var/lib/kafka/data/.snapshot, '.snapshot' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
Kafka's log directories (and children) should only contain Kafka topic data. (kafka.log.LogManager)
[2020-03-15 09:36:11,582] ERROR [KafkaServer id=1] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.KafkaException: Found directory /var/lib/kafka/data/.snapshot, '.snapshot' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
Kafka's log directories (and children) should only contain Kafka topic data.
at kafka.log.Log$.exception$1(Log.scala:2150)
at kafka.log.Log$.parseTopicPartitionName(Log.scala:2157)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:260)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:345)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This is my docker-compose scripts:
zookeeper:
image: confluentinc/cp-zookeeper:5.3.2
environment:
ZOOKEEPER_CLIENT_PORT: 2181
volumes:
- zk-data:/var/lib/zookeeper/data:nocopy
- zk-log:/var/lib/zookeeper/log:nocopy
kafka:
image: confluentinc/cp-kafka:5.3.2
environment:
KAFKA_ADVERTISED_HOST_NAME: kafka
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
volumes:
- kf-data:/var/lib/kafka/data:nocopy
volumes:
zk-data:
driver: local
driver_opts:
type: "nfs"
o: addr=18.0.3.227 #IP of NFS
device: ":/opt/data/zk-data"
zk-log:
driver: local
driver_opts:
type: "nfs"
o: addr=18.0.3.227
device: ":/opt/data/zk-log"
kf-data:
driver: local
driver_opts:
type: "nfs"
o: addr=18.0.3.227
device: ":/opt/data/kf-data"
If I go to my NFS server,
ls -la /opt/data/kf-data/.snapshot
total 80
drwxrwxrwx 33 root root 12288 Mar 28 00:10 .
drwx------ 2 root domain^users 4096 Feb 21 19:20 ..
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-14_0010
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-15_0010
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-16_0010
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-17_0010
drwx------ 2 root domain^users 4096 Feb 21 19:20 snapmirror.ka938443-8ea1-22e8-6608-00a067d1a20a_2148891236.2020-02-27_180700
There is a hidden folder named .snapshot, this folder is generated by NFS automatically and can not be removed. This is the reason why Kafka complains: Found directory /var/lib/kafka/data/.snapshot, '.snapshot' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
And this could be the general Kafka problem, is there any special configure or solution to let Kafka use the external NFS volume?
Any ideas will be grateful!
As advised above Kafka on NFS is a flawed solution due to the way NFS file system works. You will run into issues with repartitioning and expansion. This is to do with the way NFS handles deletion of open files - silly rename behaviour. You can read about it in this blog post (Kafka on NFS).
There is a hidden folder named .snapshot, this folder is generated by NFS automatically and can not be removed
Well, if there is no way around that, then Kafka will not be able to start, period.
To my knowledge, nowhere in the docs does it say remote attached storage is supported.
If you are using NetApp as NFS platform, this info could help: disable
.snapshot access in NetApp is a global vFilter function, which is not a function per folder or share.
If you can not turn off the access to .snapshot, there is no solution, unless you use other NFS platforms, which will not generate .snapshot folder in every folder.
Have you tried to not use the root of the volume? the .snapshot directory is only accessible on the root.
Zookeeper's rapidly pooping its internal binary files all over our production environment.
According to: http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html
and
http://dougchang333.blogspot.com/2013/02/zookeeper-cleaning-logs-snapshots.html
this is expected behavior and you must call org.apache.zookeeper.server.PurgeTxnLog
regularly to rotate its poop.
So:
% ls -l1rt /tmp/zookeeper/version-2/
total 314432
-rw-r--r-- 1 root root 67108880 Jun 26 18:00 log.1
-rw-r--r-- 1 root root 947092 Jun 26 18:00 snapshot.e99b
-rw-r--r-- 1 root root 67108880 Jun 27 05:00 log.e99d
-rw-r--r-- 1 root root 1620918 Jun 27 05:00 snapshot.1e266
... many more
% sudo java -cp zookeeper-3.4.6.jar::lib/jline-0.9.94.jar:lib/log4j-1.2.16.jar:lib/netty-3.7.0.Final.jar:lib/slf4j-api-1.6.1.jar:lib/slf4j-log4j12-1.6.1.jar:conf \
org.apache.zookeeper.server.PurgeTxnLog \
/tmp/zookeeper/version-2 /tmp/zookeeper/version-2 -n 3
but I get:
% ls -l1rt /tmp/zookeeper/version-2/
... all the existing logs plus a new directory
/tmp/zookeeper/version-2/version-2
Am I doing something wrong?
zookeeper-3.4.6/
ZooKeeper now has an Autopurge feature as of 3.4.0. Take a look at https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html
It says you can use autopurge.snapRetainCount and autopurge.purgeInterval
autopurge.snapRetainCount
New in 3.4.0: When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCount most recent snapshots and the corresponding transaction logs in the dataDir and dataLogDir respectively and deletes the rest. Defaults to 3. Minimum value is 3.
autopurge.purgeInterval
New in 3.4.0: The time interval in hours for which the purge task has to be triggered. Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.
Since I'm not hearing a fix via Zookeeper, this was an easy workaround:
COUNT=6
DATADIR=/tmp/zookeeper/version-2/
ls -1drt ${DATADIR}/* | head --lines=-${COUNT} | xargs sudo rm -f
Should run once a day from a cron job or jenkins to prevent zookeeper from exploding.
You need to specify the parameter dataDir and snapDir with the value that is configured as dataDir in your .properties file of zookeeper.
If your configuration looks like the following.
dataDir=/data/zookeeper
You need to call PurgeTxnLog (version 3.5.9) like the following if you want to keep the last 10 logs/snapshots
java -cp zookeeper.jar:lib/slf4j-api-1.7.5.jar:lib/slf4j-log4j12-1.7.5.jar:lib/log4j-1.2.17.jar:conf org.apache.zookeeper.server.PurgeTxnLog /data/zookeeper /data/zookeeper -n 10
I'm using thoughworks go for a build pipeline as shown below:
The "Test" stage fetches artefacts from the build stage and runs each of it's jobs in parallel (unit tests, integration test, acceptance tests, package) on different ages. However, each of those jobs is a shell script.
When those tasks are run on a different agent they are failing because permission is denied. Each job is a shell script, and when I ssh into the agent I can see it does not have executable permissions as shown below:
drwxrwxr-x 2 go go 4096 Mar 4 09:48 .
drwxrwxr-x 8 go go 4096 Mar 4 09:48 ..
-rw-rw-r-- 1 go go 408 Mar 4 09:48 aa_tests.sh
-rw-rw-r-- 1 go go 443 Mar 4 09:48 Dockerfile
-rw-rw-r-- 1 go go 121 Mar 4 09:48 run.sh
However, in the git repository they have executable permission and they seem to execute fine on the build agent that clones the git repository.
I solved the problem by executing the script with bash. E.g "bash sriptname.sh" as the command for the task.