Kafka doesn't work with external NFS Volume - apache-kafka

I am trying to run Kafka with mounted NFS Volume, facing exception and can not start Kafka:
[2020-03-15 09:36:11,580] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.KafkaException: Found directory /var/lib/kafka/data/.snapshot, '.snapshot' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
Kafka's log directories (and children) should only contain Kafka topic data. (kafka.log.LogManager)
[2020-03-15 09:36:11,582] ERROR [KafkaServer id=1] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.KafkaException: Found directory /var/lib/kafka/data/.snapshot, '.snapshot' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
Kafka's log directories (and children) should only contain Kafka topic data.
at kafka.log.Log$.exception$1(Log.scala:2150)
at kafka.log.Log$.parseTopicPartitionName(Log.scala:2157)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:260)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:345)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This is my docker-compose scripts:
zookeeper:
image: confluentinc/cp-zookeeper:5.3.2
environment:
ZOOKEEPER_CLIENT_PORT: 2181
volumes:
- zk-data:/var/lib/zookeeper/data:nocopy
- zk-log:/var/lib/zookeeper/log:nocopy
kafka:
image: confluentinc/cp-kafka:5.3.2
environment:
KAFKA_ADVERTISED_HOST_NAME: kafka
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
volumes:
- kf-data:/var/lib/kafka/data:nocopy
volumes:
zk-data:
driver: local
driver_opts:
type: "nfs"
o: addr=18.0.3.227 #IP of NFS
device: ":/opt/data/zk-data"
zk-log:
driver: local
driver_opts:
type: "nfs"
o: addr=18.0.3.227
device: ":/opt/data/zk-log"
kf-data:
driver: local
driver_opts:
type: "nfs"
o: addr=18.0.3.227
device: ":/opt/data/kf-data"
If I go to my NFS server,
ls -la /opt/data/kf-data/.snapshot
total 80
drwxrwxrwx 33 root root 12288 Mar 28 00:10 .
drwx------ 2 root domain^users 4096 Feb 21 19:20 ..
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-14_0010
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-15_0010
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-16_0010
drwx------ 2 root domain^users 4096 Feb 13 11:06 daily.2020-02-17_0010
drwx------ 2 root domain^users 4096 Feb 21 19:20 snapmirror.ka938443-8ea1-22e8-6608-00a067d1a20a_2148891236.2020-02-27_180700
There is a hidden folder named .snapshot, this folder is generated by NFS automatically and can not be removed. This is the reason why Kafka complains: Found directory /var/lib/kafka/data/.snapshot, '.snapshot' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
And this could be the general Kafka problem, is there any special configure or solution to let Kafka use the external NFS volume?
Any ideas will be grateful!

As advised above Kafka on NFS is a flawed solution due to the way NFS file system works. You will run into issues with repartitioning and expansion. This is to do with the way NFS handles deletion of open files - silly rename behaviour. You can read about it in this blog post (Kafka on NFS).

There is a hidden folder named .snapshot, this folder is generated by NFS automatically and can not be removed
Well, if there is no way around that, then Kafka will not be able to start, period.
To my knowledge, nowhere in the docs does it say remote attached storage is supported.

If you are using NetApp as NFS platform, this info could help: disable
.snapshot access in NetApp is a global vFilter function, which is not a function per folder or share.
If you can not turn off the access to .snapshot, there is no solution, unless you use other NFS platforms, which will not generate .snapshot folder in every folder.

Have you tried to not use the root of the volume? the .snapshot directory is only accessible on the root.

Related

kafka + which files should created under kafka-logs

usually after kafka cluster scratch installation I saw this files under /data/kafka-logs ( kafka broker logs. where all topics should be located )
ls -ltr
-rw-r--r-- 1 kafka hadoop 0 Jan 9 10:07 cleaner-offset-checkpoint
-rw-r--r-- 1 kafka hadoop 57 Jan 9 10:07 meta.properties
drwxr-xr-x 2 kafka hadoop 4096 Jan 9 10:51 _schemas-0
-rw-r--r-- 1 kafka hadoop 17 Jan 10 07:39 recovery-point-offset-checkpoint
-rw-r--r-- 1 kafka hadoop 17 Jan 10 07:39 replication-offset-checkpoint
but on some other Kafka scratch installation we saw the folder - /data/kafka-logs is empty
is this indicate on problem ?
note - we still not create the topics
I'm not sure when each checkpoint file is created (though, they track log cleaner and replication offsets), but I assume that the meta properties is created at broker startup.
Otherwise, you would see one folder per Topic-partition, for example, looks like you had one topic created, _schemas.
If you only see one partition folder out of multiple brokers, then your replication factor for that topic is set to 1

Docker Named Volume with targeting windows local folder

In docker-compose file I want to create a named volume which will target local drive for test purposes. For production we will use NFS.
I created the compose file as following,
version: '3.3'
services:
test:
build: .
volumes:
- type: volume
source: data_volume
target: /data
networks:
- network
volumes:
data_volume:
driver: local
driver_opts:
o: bind
type: none
device: c:/data
networks:
network:
driver: overlay
attachable: true
When I run the docker-compose up, I got the following error,
for test_test_1 Cannot create container for service test: failed to mount local volume:
mount c:/data:/var/lib/docker/volumes/test_data_volume/_data, flags: 0x1000: no such file
or directory
Even with errors, it still creates the named volume. So when I inspect it,
{
"CreatedAt": "2019-10-07T09:10:14Z",
"Driver": "local",
"Labels": {
"com.docker.compose.project": "test",
"com.docker.compose.version": "1.24.1",
"com.docker.compose.volume": "data_volume"
},
"Mountpoint": "/var/lib/docker/volumes/test_data_volume/_data",
"Name": "test_data_volume",
"Options": {
"device": "c:/data",
"o": "bind",
"type": "none"
},
"Scope": "local"
}
I'm still not sure why the Mountpoint is targeting that location.
I know I can achieve this without named volume (which I already did), but for future in the project we definitely need named volume.
Any suggestion how to achieve this?
Same here. Using Docker Desktop for Windows, I tried to mount the local path E:\Project\MyWebsite\code to the named volume but failed. Here's how I sorted this out.
First, I changed the path to ".":
volumes:
website:
driver: local
driver_opts:
type: none
device: "."
o: bind
This time docker-compose up ran successfully, so I logged into the shell and checked how the mounted directory looks like:
bash-5.0# ls -l
total 62
lrwxrwxrwx 1 root root 11 Oct 1 15:15 E -> /host_mnt/e
drwxr-xr-x 2 root root 14336 Sep 11 15:27 bin
drwxr-xr-x 4 root root 2048 Apr 19 2017 dev
lrwxrwxrwx 1 root root 11 Oct 1 15:15 e -> /host_mnt/e
drwxr-xr-x 1 root root 180 Sep 30 11:53 etc
drwxr-xr-x 2 root root 2048 Sep 11 15:27 home
drw-r--r-- 4 root root 80 Oct 8 22:52 host_mnt
drwxr-xr-x 1 root root 60 Sep 30 11:53 lib
drwxr-xr-x 5 root root 2048 Sep 11 15:27 media
...
drwxrwxrwt 1 root root 40 Oct 11 19:37 tmp
drwxr-xr-x 1 root root 80 Sep 11 15:27 usr
drwxr-xr-x 13 root root 2048 Sep 11 15:27 var
Obviously not a Windows volume, probably some Linux VM created by Docker. But the paths /host_mnt/e and /host_mnt/E seem indicative, so I tried changing the docker-compose definition to:
volumes:
website:
driver: local
driver_opts:
type: none
device: "/host_mnt/e/Project/MyWebsite/code"
o: bind
And it worked! Looks like named volume doesn't work the same as the ordinal way for Windows.
This /host_mnt/e probably won't exist unless the you've granted access to the drive letter before. But this shouldn't be an issue to you, as you'd tried the ordinal way of mounting a local drive which worked.

openshift postgres persistent volume permissions

The postgres image I am currently deploying with openshift is generally working great. However I need to persistently store the database data (of course) and to do so i created a persistent volume claim and mounted it to the postgres data directory like so:
- mountPath: /var/lib/pgsql/data/userdata
name: db-storage-volume
and
- name: db-storage-volume
persistentVolumeClaim:
claimName: db-storage
The problem I am facing now is that the initdb script wants to change the permission of that data folder, but it cant and the directory is assigned to a very weird user/group, as the output of ls -la /var/lib/pgsql/data indicates (including the failing command output):
total 12
drwxrwxr-x. 3 postgres root 21 Aug 30 13:06 .
drwxrwx---. 3 postgres root 17 Apr 5 09:55 ..
drwxrwxrwx. 2 nobody nobody 12288 Jun 26 11:11 userdata
chmod: changing permissions of '/var/lib/pgsql/data/userdata': Permission denied
How can I handle this? I mean the permissions are enough to read/write but initdb (and the base images initialization functions) really want to change the permission of that folder.
Just as I had sent my question I had an idea and it turns out it worked:
Change the mount to the parent folder /var/lib/pgsql/data/
Modify my entry script to include a mkdir /var/lib/pgsql/data/userdata when it runs first (aka the folder does not exist yet)
Now it is:
total 16
drwxrwxrwx. 3 nobody nobody 12288 Aug 30 13:19 .
drwxrwx---. 3 postgres root 17 Apr 5 09:55 ..
drwxr-xr-x. 2 1001320000 nobody 4096 Aug 30 13:19 userdata
Which works. Notice that the folder itself is still owned by nobody:nobody and is 777, but the created userdata folder is owned by the correct user.

Spring Data Flow Yarn - unable access jarfile

I try to run simple spring batch task on Spring Cloud Data Flow for Yarn. Unfortunatelly while running it I got error message in ResourceManager UI:
Application application_1473838120587_5156 failed 1 times due to AM Container for appattempt_1473838120587_5156_000001 exited with exitCode: 1
For more detailed output, check application tracking page:http://ip-10-249-9-50.gc.stepstone.com:8088/cluster/app/application_1473838120587_5156Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1473838120587_5156_01_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
More informations from Appmaster.stderror stated that:
Log Type: Appmaster.stderr
Log Upload Time: Mon Nov 07 12:59:57 +0000 2016
Log Length: 106
Error: Unable to access jarfile spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.BUILD-SNAPSHOT.jar
If it comes to Spring Cloud Data Flow I'm trying to run in dataflow-shell:
app register --type task --name simple_batch_job --uri https://github.com/spring-cloud/spring-cloud-dataflow-samples/raw/master/tasks/simple-batch-job/batch-job-1.0.0.BUILD-SNAPSHOT.jar
task create foo --definition "simple_batch_job"
task launch foo
Its really hard to know why this error occurs. I'm sure that connection from dataflow-server to yarn works fine because in standard HDFS localization (/dataflow) some files was copied (servers.yml, jars with jobs and utilities) but it is unaccessible in some way.
My servers.yml config:
logging:
level:
org.apache.hadoop: DEBUG
org.springframework.yarn: DEBUG
maven:
remoteRepositories:
springRepo:
url: https://repo.spring.io/libs-snapshot
spring:
main:
show_banner: false
hadoop:
fsUri: hdfs://HOST:8020
resourceManagerHost: HOST
resourceManagerPort: 8032
resourceManagerSchedulerAddress: HOST:8030
datasource:
url: jdbc:h2:tcp://localhost:19092/mem:dataflow
username: sa
password:
driverClassName: org.h2.Driver
I'll be glad to hear any informations or spring-yarn tips&tricks to make this work.
PS: As hadoop environment I use Amazon EMR 5.0
EDIT: Recursive path from hdfs:
drwxrwxrwx - user hadoop 0 2016-11-07 15:02 /dataflow/apps
drwxrwxrwx - user hadoop 0 2016-11-07 15:02 /dataflow/apps/stream
drwxrwxrwx - user hadoop 0 2016-11-07 15:04 /dataflow/apps/stream/app
-rwxrwxrwx 3 user hadoop 121 2016-11-07 15:05 /dataflow/apps/stream/app/application.properties
-rwxrwxrwx 3 user hadoop 1177 2016-11-07 15:04 /dataflow/apps/stream/app/servers.yml
-rwxrwxrwx 3 user hadoop 60202852 2016-11-07 15:04 /dataflow/apps/stream/app/spring-cloud-deployer-yarn-appdeployerappmaster-1.0.0.RELEASE.jar
drwxrwxrwx - user hadoop 0 2016-11-04 14:22 /dataflow/apps/task
drwxrwxrwx - user hadoop 0 2016-11-04 14:24 /dataflow/apps/task/app
-rwxrwxrwx 3 user hadoop 121 2016-11-04 14:25 /dataflow/apps/task/app/application.properties
-rwxrwxrwx 3 user hadoop 2101 2016-11-04 14:24 /dataflow/apps/task/app/servers.yml
-rwxrwxrwx 3 user hadoop 60198804 2016-11-04 14:24 /dataflow/apps/task/app/spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.RELEASE.jar
drwxrwxrwx - user hadoop 0 2016-11-04 14:25 /dataflow/artifacts
drwxrwxrwx - user hadoop 0 2016-11-07 15:06 /dataflow/artifacts/cache
-rwxrwxrwx 3 user hadoop 12323493 2016-11-04 14:25 /dataflow/artifacts/cache/https-c84ea9dc0103a4754aeb9a28bbc7a4f33c835854-batch-job-1.0.0.BUILD-SNAPSHOT.jar
-rwxrwxrwx 3 user hadoop 22139318 2016-11-07 15:07 /dataflow/artifacts/cache/log-sink-rabbit-1.0.0.BUILD-SNAPSHOT.jar
-rwxrwxrwx 3 user hadoop 12590921 2016-11-07 12:59 /dataflow/artifacts/cache/timestamp-task-1.0.0.BUILD-SNAPSHOT.jar
There's clearly a mix of wrong versions as hdfs has spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.RELEASE.jar and error complains about spring-cloud-deployer-yarn-tasklauncherappmaster-1.0.0.BUILD-SNAPSHOT.jar.
Not sure how you got snapshots unless you built distribution manually?
I'd recommend picking 1.0.2 from http://cloud.spring.io/spring-cloud-dataflow-server-yarn. See "Download and Extract Distribution" from ref docs. Also delete old /dataflow directory from hdfs.

FAILED TO WRITE PID installing Zookeeper

I am new to Zookeeper and it has being a real issue to install it and run. I am not sure what is wrong in here but I will explain what I've being doing to make it more clear:
1.- I've followed the installation guide provided by Apache. This means download the Zookeeper distribution (stable release) extracted the file and moved into the home directory.
2.- As I am using Ubuntu 12.04 I've modified the .bashrc file including this:
export ZOOKEEPER_INSTALL=/home/myusername/zookeeper-3.4.5
export PATH=$PATH:$ZOOKEEPER_INSTALL/bin
3.- Create a config file on conf/zoo.cfg
tickTime=2000
dataDir=/var/zookeeper
clientPort=2181
and also tried with:
dataDir=/var/log/zookeeper
and
dataDir=/var/bin/zookeeper
4.- When running the start command
zkServer.sh start or `bin/zkServer.sh start` nothing happens and always returns this
JMX enabled by default
Using config: /home/sasuke/zookeeper-3.4.5/bin/../conf/zoo.cfg
mkdir: cannot create directory `/var/zookeeper': Permission denied
Starting zookeeper ... /home/sasuke/zookeeper-3.4.5/bin/zkServer.sh: line 113: /var/zookeeper/zookeeper_server.pid: No such file or directory
FAILED TO WRITE PID
I have Java installed and inside the zookeper directory there is a zookeeper.jar file that I think it's not running.
Checking here on stackoverflow there was a guy that said he could run zookeeper after typing
ssh localhost
But when I try to do it I get this error
ssh: connect to host localhost port 22: Connection refused
Please help. I've being here trying to solve it for too long.
Getting started guide of zookeeper:
http://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html
Previous case solved with the shh localhost
Zookeeper: FAILED TO WRITE PID
UPDATE:
The permissions for log are:
drwxr-xr-x 19 root root 4096 Oct 10 07:52 log
and for zookeeper:
drwxr-xr-x 2 zookeeper zookeeper 4096 Mar 23 2012 zookeeper
Should I change any of these?
I have had the same problem. In my case was useful to start Zookeeper and directly specify a configuration file:
/bin/zkServer.sh start conf/zoo.conf
It seems you do not have the required permissions. The /var/log owner is is going to be root. Zookeeper stores the process id and snapshot of data in that directory. The process id of the spawned zookeeper server is stored in a file -zookeeper_server.pid (as of 3.3.6)
If you have root previleges, you could start zookeeper with sudo (root) previleges, it should work but definitely not recommended. Make sure you start zookeeper with the same(or higher) permissions as the owner of the directory.
Create a new directory in your home folder like /home/username/zookeeper-data.
Let dataDir point to that directory and it should work.
The default zookeeper installation (tar extract) comes with the conf file named conf/zoo_sample.cfg while the same extract's bin/zkServer.sh expects the conf file to be called zoo.cfg thereby resulting in a "No such file or dir" and the "failed to write pid" error. So before running zkServer.sh to start or stop zookeeper instance, either:
rename the zoo_sample.cfg in the conf dir to zoo.cfg, or
give the name (and path) to the conf file (as suggested by Ilya Lapitan), or, of course
edit zkServer.sh ;-)
When you create the Directory for dataDir make sure to use the -p option. This will allow subsequent directories to be created as required by the application placing files.
mkdir -p /var/log/zookeeperData
Then set:
dataDir=/var/log/zookeeperData
Seems there's all kinds of reasons this can happen. So many helpful answers here!
For me, I had improper line endings in my zoo.cfg file, and possibly invisible characters, so zookeeper was trying to create directories like /var/zookeeper? and /var/zookeeper\r. Reworking my zoo.cfg a bit fixed it for me, along with deleting zoo_sample.conf.
This happens to me due to low disk space. cause zookeeper cant create pid file inside zookeeper data folder.
I have faced the same issue while starting the zookeeper with this command:
hadoop#ubuntu:~/hadoop/zookeeper/zookeeper-3.4.8$ bin/zkServer.sh
start
ERROR [main] client.ConnectionManager$HConnectionImplementation:
The node /hbase is not in ZooKeeper.
It should have been written by the master. Check the value configured in zookeeper.znode.parent. There could be a mismatch with the one configured in the master.
But running the script as su rectified the issue:
hadoop#ubuntu:~/hadoop/zookeeper/zookeeper-3.4.8$ sudo bin/zkServer.sh
start
ZooKeeper JMX enabled by default Using config:
/home/hadoop/hadoop/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
Go to /usr/local/etc/
You will find zookeeper directory
delete the directory
and restart the server - zkServer start
Change the path give dataDir=/tmp/zookeeper. If it works then its clearly access issues
But its generally not advisable to use tmp directory.
This seems to be an ownership issue; running the following solved this for me.
$ sudo chown -R $USER /var/lib/zookeeper
N.B.
I've outlined my steps below which show the error I was getting (the same as the error in this SO question) and the attempt at trying the solution proposed by a user above, which advised to provide zoo.cfg as an argument.
13:01:29 ✔ ~ :: $ZK/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/../conf/zoo.cfg
Starting zookeeper ... /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/zkServer.sh: line 149: /var/lib/zookeeper/zookeeper_server.pid: Permission denied
FAILED TO WRITE PID
13:01:32 ✘ ~ :: $ZK/bin/zkServer.sh start $ZK/conf/zoo.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/Cellar/zookeeper/3.4.14/libexec/conf/zoo.cfg
Starting zookeeper ... /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/zkServer.sh: line 149: /var/lib/zookeeper/zookeeper_server.pid: Permission denied
FAILED TO WRITE PID
13:04:45 ✔ /var/lib :: ls -la
total 0
drwxr-xr-x 4 root wheel 128 Apr 19 18:55 .
drwxr-xr-x 27 root wheel 864 Apr 19 18:55 ..
drwxr--r-- 3 root wheel 96 Mar 24 15:07 zookeeper
13:04:48 ✔ /var/lib :: echo $USER
tallamjr
13:06:03 ✔ /var/lib :: sudo chown -R $USER zookeeper
Password:
13:06:44 ✔ /var/lib :: ls -la
total 0
drwxr-xr-x 4 root wheel 128 Apr 19 18:55 .
drwxr-xr-x 27 root wheel 864 Apr 19 18:55 ..
drwxr--r-- 3 tallamjr wheel 96 Mar 24 15:07 zookeeper
13:06:48 ✔ ~ :: $ZK/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
REF:
- https://askubuntu.com/questions/6723/change-folder-permissions-and-ownership
For me this solution worked:
I granted the read, write and execute permissions for everyone using the command $sudo chmod 777 foldername for the directory zookeeper by going inside the directory /var (/var/zookeeper).
After executing this command try running the zookeeper. It ran in my case
try to use sudo -E bin/zkServer.sh start