Kafka's log directories should only contain Kafka topic data - apache-kafka

I'm trying to run confluent kafka image in kubernetes environment & facing
FATAL [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
kafka.common.KafkaException: Found directory /var/lib/kafka/data, 'data' is not in the form of topic-partition or topic-partition.uniqueId-delete (if marked for deletion).
Kafka's log directories (and children) should only contain Kafka topic data.
My deployment config:
apiVersion: apps/v1beta2 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: kafka-confluent
labels:
app: kafka-confluent
spec:
replicas: 1
selector:
matchLabels:
app: kafka-confluent
template:
metadata:
labels:
app: kafka-confluent
spec:
containers:
- name: zookeeper-kafka
image: zookeeper:3.5
ports:
- containerPort: 2181
- name: kafka-confluent
image: confluentinc/cp-kafka:4.0.0
ports:
- containerPort: 9092
command:
- sh
- -c
- "exec kafka-server-start /etc/kafka/server.properties \
--override reserved.broker.max.id=2147483647 \
--override zookeeper.connect=localhost:2181 \
--override listeners=PLAINTEXT://:9092 \
"
To solve this I've tried to mount some ephemeral volume like this.
volumes:
- name: kafka-data
emptyDir: {}
...
volumeMounts:
- mountPath: /var/lib/kafka/data
name: kafka-data
And clear the data dir with init container:
containers:
- name: cleaner
image: busybox
command: ['rm', '-rf', '/var/lib/kafka/data/*']
Both tries failed with the same result.
Also If I run the image & list the data /var/lib/kafka/data/
Looks like the directory is empty.
docker run --rm -it confluentinc/cp-kafka:4.0.0 bash
root#35087653f43a:/# ls /var/lib/kafka/data/ -al
total 8
drwxrwxrwx 2 root root 4096 Apr 16 09:59 .
drwxr-xr-x 3 root root 4096 Jan 3 19:20 ..

is it your error?
Fix add this:
log.dirs=/var/lib/kafka/data
in server.properties

I have this working configuration:
volumeMounts:
- name: data
mountPath: /var/lib/kafka
and in the command, override the log.dirs
kafka-server-start /etc/kafka/server.properties \
--override log.dirs=/var/lib/kafka

in my windows system I added this
log.dirs=F:\softwares\kafka_2.12-2.5.1\data
server.properties
where
F:\softwares\kafka_2.12-2.5.1
is my kafka directory and issue resolved
here I created data folder inside kafka directory, it was not there when I extracted kafka from 7zip tool.

Related

Worker unable to connect to the master and invalid args in webport for Locust

I am trying to set up a load test for an endpoint. This is what I have followed so far:
Dockerfile
FROM python:3.8
# Add the external tasks directory into /tasks
WORKDIR /src
ADD requirements.txt .
RUN pip install --no-cache-dir --upgrade locust==2.10.1
ADD run.sh .
ADD load_test.py .
ADD load_test.conf .
# Expose the required Locust ports
EXPOSE 5557 5558 8089
# Set script to be executable
RUN chmod 755 run.sh
# Start Locust using LOCUS_OPTS environment variable
CMD ["bash", "run.sh"]
# Modified from:
# https://github.com/scrollocks/locust-loadtesting/blob/master/locust/docker/Dockerfile
run.sh
#!/bin/bash
LOCUST="locust"
LOCUS_OPTS="--config=load_test.conf"
LOCUST_MODE=${LOCUST_MODE:-standalone}
if [[ "$LOCUST_MODE" = "master" ]]; then
LOCUS_OPTS="$LOCUS_OPTS --master"
elif [[ "$LOCUST_MODE" = "worker" ]]; then
LOCUS_OPTS="$LOCUS_OPTS --worker --master-host=$LOCUST_MASTER_HOST"
fi
echo "${LOCUST} ${LOCUS_OPTS}"
$LOCUST $LOCUS_OPTS
# Copied from
# https://github.com/scrollocks/locust-loadtesting/blob/master/locust/docker/locust/run.sh
This is how I have written the load test locust script:
import json
from locust import HttpUser, constant, task
class CategorizationUser(HttpUser):
wait_time = constant(1)
#task
def predict(self):
payload = json.dumps(
{
"text": "Face and Body Paint washable Rubies Halloween item 91#385"
}
)
_ = self.client.post("/predict", data=payload)
I am invoking that with a configuration:
locustfile = load_test.py
headless = false
users = 1000
spawn-rate = 1
run-time = 5m
host = IP
html = locust_report.html
So, after building and pushing the Docker image and creating a k8s cluster on GKE, I am deploying it. This is how the deployment.yaml looks like:
# Copied from
# https://github.com/scrollocks/locust-loadtesting/blob/master/locust/kubernetes/templates/deployment.yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: default
name: locust-master-deployment
labels:
name: locust
role: master
spec:
replicas: 1
selector:
matchLabels:
name: locust
role: master
template:
metadata:
labels:
name: locust
role: master
spec:
containers:
- name: locust
image: gcr.io/PROJECT_ID/IMAGE_URI
imagePullPolicy: Always
env:
- name: LOCUST_MODE
value: master
- name: LOCUST_LOG_LEVEL
value: DEBUG
ports:
- name: loc-master-web
containerPort: 8089
protocol: TCP
- name: loc-master-p1
containerPort: 5557
protocol: TCP
- name: loc-master-p2
containerPort: 5558
protocol: TCP
---
kind: Deployment
apiVersion: apps/v1
metadata:
namespace: default
name: locust-worker-deployment
labels:
name: locust
role: worker
spec:
replicas: 2
selector:
matchLabels:
name: locust
role: worker
template:
metadata:
labels:
name: locust
role: worker
spec:
containers:
- name: locust
image: gcr.io/PROJECT_ID/IMAGE_URI
imagePullPolicy: Always
env:
- name: LOCUST_MODE
value: worker
- name: LOCUST_MASTER
value: locust-master-service
- name: LOCUST_LOG_LEVEL
value: DEBUG
After deployment, I am exposing the required ports like so:
kubectl expose pod locust-master-deployment-f9d4c4f59-8q6wk \
--type NodePort \
--port 5558 \
--target-port 5558 \
--name locust-5558
kubectl expose pod locust-master-deployment-f9d4c4f59-8q6wk \
--type NodePort \
--port 5557 \
--target-port 5557 \
--name locust-5557
kubectl expose pod locust-master-deployment-f9d4c4f59-8q6wk \
--type LoadBalancer \
--port 80 \
--target-port 8089 \
--name locust-web
The cluster and the nodes provision successfully. But the moment I hit the IP of locust-web, I am getting:
Any suggestions on how to resolve the bug?
Since you are exposing your pods and you are trying to access to them outside the cluster (your web application), you have to port-forward them or add an Ingress in order to access to your locust pods.
My first approach will be trying to ping or send requests to your locust pods with a simple port-forward.
More infos about the port forwarding here.
Probably the environment variables set by k8s is colliding with locust’s (LOCUST_WEB_PORT specifically). Change your setup so that no containers are named ”locust”.
See https://github.com/locustio/locust/issues/1226 for a similar issue.

The server must be started by the user that owns the data directory

I am trying to get some persistant storage for a docker instance of PostgreSQL running on Kubernetes. However, the pod fails with
FATAL: data directory "/var/lib/postgresql/data" has wrong ownership
HINT: The server must be started by the user that owns the data directory.
This is the NFS configuration:
% exportfs -v
/srv/nfs/postgresql/postgres-registry
kubehost*.example.com(rw,wdelay,insecure,no_root_squash,no_subtree_check,sec=sys,rw,no_root_squash,no_all_squash)
$ ls -ldn /srv/nfs/postgresql/postgres-registry
drwxrwxrwx. 3 999 999 4096 Jul 24 15:02 /srv/nfs/postgresql/postgres-registry
$ ls -ln /srv/nfs/postgresql/postgres-registry
total 4
drwx------. 2 999 999 4096 Jul 25 08:36 pgdata
The full log from the pod:
2019-07-25T07:32:50.617532000Z The files belonging to this database system will be owned by user "postgres".
2019-07-25T07:32:50.618113000Z This user must also own the server process.
2019-07-25T07:32:50.619048000Z The database cluster will be initialized with locale "en_US.utf8".
2019-07-25T07:32:50.619496000Z The default database encoding has accordingly been set to "UTF8".
2019-07-25T07:32:50.619943000Z The default text search configuration will be set to "english".
2019-07-25T07:32:50.620826000Z Data page checksums are disabled.
2019-07-25T07:32:50.621697000Z fixing permissions on existing directory /var/lib/postgresql/data ... ok
2019-07-25T07:32:50.647445000Z creating subdirectories ... ok
2019-07-25T07:32:50.765065000Z selecting default max_connections ... 20
2019-07-25T07:32:51.035710000Z selecting default shared_buffers ... 400kB
2019-07-25T07:32:51.062039000Z selecting default timezone ... Etc/UTC
2019-07-25T07:32:51.062828000Z selecting dynamic shared memory implementation ... posix
2019-07-25T07:32:51.218995000Z creating configuration files ... ok
2019-07-25T07:32:51.252788000Z 2019-07-25 07:32:51.251 UTC [79] FATAL: data directory "/var/lib/postgresql/data" has wrong ownership
2019-07-25T07:32:51.253339000Z 2019-07-25 07:32:51.251 UTC [79] HINT: The server must be started by the user that owns the data directory.
2019-07-25T07:32:51.262238000Z child process exited with exit code 1
2019-07-25T07:32:51.263194000Z initdb: removing contents of data directory "/var/lib/postgresql/data"
2019-07-25T07:32:51.380205000Z running bootstrap script ...
The deployment has the following in:
securityContext:
runAsUser: 999
supplementalGroups: [999,1000]
fsGroup: 999
What am I doing wrong?
Edit: Added storage.yaml file:
kind: PersistentVolume
apiVersion: v1
metadata:
name: postgres-registry-pv-volume
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
nfs:
server: 192.168.3.7
path: /srv/nfs/postgresql/postgres-registry
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: postgres-registry-pv-claim
labels:
app: postgres-registry
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
Edit: And the full deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: postgres-registry
spec:
replicas: 1
template:
metadata:
labels:
app: postgres-registry
spec:
securityContext:
runAsUser: 999
supplementalGroups: [999,1000]
fsGroup: 999
containers:
- name: postgres-registry
image: postgres:latest
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: postgresdb
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: Sekret
volumeMounts:
- mountPath: /var/lib/postgresql/data
subPath: "pgdata"
name: postgredb-registry-persistent-storage
volumes:
- name: postgredb-registry-persistent-storage
persistentVolumeClaim:
claimName: postgres-registry-pv-claim
Even more debugging adding:
command: ["/bin/bash", "-c"]
args:["id -u; ls -ldn /var/lib/postgresql/data"]
Which returned:
999
drwx------. 2 99 99 4096 Jul 25 09:11 /var/lib/postgresql/data
Clearly, the UID/GID are wrong. Why?
Even with the work around suggested by Jakub Bujny, I get this:
2019-07-25T09:32:08.734807000Z The files belonging to this database system will be owned by user "postgres".
2019-07-25T09:32:08.735335000Z This user must also own the server process.
2019-07-25T09:32:08.736976000Z The database cluster will be initialized with locale "en_US.utf8".
2019-07-25T09:32:08.737416000Z The default database encoding has accordingly been set to "UTF8".
2019-07-25T09:32:08.737882000Z The default text search configuration will be set to "english".
2019-07-25T09:32:08.738754000Z Data page checksums are disabled.
2019-07-25T09:32:08.739648000Z fixing permissions on existing directory /var/lib/postgresql/data ... ok
2019-07-25T09:32:08.766606000Z creating subdirectories ... ok
2019-07-25T09:32:08.852381000Z selecting default max_connections ... 20
2019-07-25T09:32:09.119031000Z selecting default shared_buffers ... 400kB
2019-07-25T09:32:09.145069000Z selecting default timezone ... Etc/UTC
2019-07-25T09:32:09.145730000Z selecting dynamic shared memory implementation ... posix
2019-07-25T09:32:09.168161000Z creating configuration files ... ok
2019-07-25T09:32:09.200134000Z 2019-07-25 09:32:09.199 UTC [70] FATAL: data directory "/var/lib/postgresql/data" has wrong ownership
2019-07-25T09:32:09.200715000Z 2019-07-25 09:32:09.199 UTC [70] HINT: The server must be started by the user that owns the data directory.
2019-07-25T09:32:09.208849000Z child process exited with exit code 1
2019-07-25T09:32:09.209316000Z initdb: removing contents of data directory "/var/lib/postgresql/data"
2019-07-25T09:32:09.274741000Z running bootstrap script ... 999
2019-07-25T09:32:09.278124000Z drwx------. 2 99 99 4096 Jul 25 09:32 /var/lib/postgresql/data
Using your setup and ensuring the nfs mount is owned by 999:999 it worked just fine.
You're also missing an 's' in your name: postgredb-registry-persistent-storage
And with your subPath: "pgdata" do you need to change the $PGDATA? I didn't include the subpath for this.
$ sudo mount 172.29.0.218:/test/nfs ./nfs
$ sudo su -c "ls -al ./nfs" postgres
total 8
drwx------ 2 postgres postgres 4096 Jul 25 14:44 .
drwxrwxr-x 3 rei rei 4096 Jul 25 14:44 ..
$ kubectl apply -f nfspv.yaml
persistentvolume/postgres-registry-pv-volume created
persistentvolumeclaim/postgres-registry-pv-claim created
$ kubectl apply -f postgres.yaml
deployment.extensions/postgres-registry created
$ sudo su -c "ls -al ./nfs" postgres
total 124
drwx------ 19 postgres postgres 4096 Jul 25 14:46 .
drwxrwxr-x 3 rei rei 4096 Jul 25 14:44 ..
drwx------ 3 postgres postgres 4096 Jul 25 14:46 base
drwx------ 2 postgres postgres 4096 Jul 25 14:46 global
drwx------ 2 postgres postgres 4096 Jul 25 14:46 pg_commit_ts
. . .
I noticed using nfs: directly in the persistent volume took significantly longer to initialize the database, whereas using hostPath: to the mounted nfs volume behaved normally.
So after a few minutes:
$ kubectl logs postgres-registry-675869694-9fp52 | tail -n 3
2019-07-25 21:50:57.181 UTC [30] LOG: database system is ready to accept connections
done
server started
$ kubectl exec -it postgres-registry-675869694-9fp52 psql
psql (11.4 (Debian 11.4-1.pgdg90+1))
Type "help" for help.
postgres=#
Checking the uid/gid
$ kubectl exec -it postgres-registry-675869694-9fp52 bash
postgres#postgres-registry-675869694-9fp52:/$ whoami && id -u && id -g
postgres
999
999
nfspv.yaml:
kind: PersistentVolume
apiVersion: v1
metadata:
name: postgres-registry-pv-volume
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
nfs:
server: 172.29.0.218
path: /test/nfs
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: postgres-registry-pv-claim
labels:
app: postgres-registry
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
postgres.yaml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: postgres-registry
spec:
replicas: 1
template:
metadata:
labels:
app: postgres-registry
spec:
securityContext:
runAsUser: 999
supplementalGroups: [999,1000]
fsGroup: 999
containers:
- name: postgres-registry
image: postgres:latest
imagePullPolicy: "IfNotPresent"
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: postgresdb
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: Sekret
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: postgresdb-registry-persistent-storage
volumes:
- name: postgresdb-registry-persistent-storage
persistentVolumeClaim:
claimName: postgres-registry-pv-claim
I cannot explain why those 2 IDs are different but as workaround I would try to override postgres's entrypoint with
command: ["/bin/bash", "-c"]
args: ["chown -R 999:999 /var/lib/postgresql/data && ./docker-entrypoint.sh postgres"]
This type of errors is quite common when you link a NTFS directory into your docker container. NTFS directories don't support ext3 file & directory access control. The only way to make it work is to link directory from a ext3 drive into your container.
I got a bit desperate when I played around Apache / PHP containers with linking the www folder. After I linked files reside on a ext3 filesystem the problem disappear.
I published a short Docker tutorial on youtube, may it helps to understand this problem: https://www.youtube.com/watch?v=eS9O05TTFjM

Combining multiple Local-SSD on a node in Kubernetes (GKE)

The data required by my container is too large to fit on one local SSD. I also need to access the SSD's as one filesystem from my container. So I would need to attach multiple ones. How do I combine them (single partition, RAID0, etc) and make them accessible as one volume mount in my container?
This link shares how to mount an SSD https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/local-ssd to a mount path. I am not sure how you would merge multiple.
edit
The question asks how one would "combine" multiple SSD devices, individually mounted, on a single node in GKE.
WARNING
this is experimental and not intended for production use without
knowing what you are doing and only tested on gke version 1.16.x.
The approach includes a daemonset using a configmap to use nsenter (with wait tricks) for host namespace and privileged access so you can manage the devices. Specifically for GKE Local SSDs, we can unmount those devices and then raid0 them. InitContainer for the dirty work as this type of task seems most apparent for something you'd need to mark complete, and to then kill privileged container access (or even the Pod). Here is how it is done.
The example assumes 16 SSDs, however, you'll want to adjust the hardcoded values as necessary. Also, ensure your OS image reqs, I use Ubuntu. Also make sure the version of GKE you use starts local-ssd's at sd[b]
ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: local-ssds-setup
namespace: search
data:
setup.sh: |
#!/bin/bash
# returns exit codes: 0 = found, 1 = not found
isMounted() { findmnt -rno SOURCE,TARGET "$1" >/dev/null;} #path or device
# existing disks & mounts
SSDS=(/dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq)
# install mdadm utility
apt-get -y update && apt-get -y install mdadm --no-install-recommends
apt-get autoremove
# OPTIONAL: determine what to do with existing, I wipe it here
if [ -b "/dev/md0" ]
then
echo "raid array already created"
if isMounted "/dev/md0"; then
echo "already mounted - unmounting"
umount /dev/md0 &> /dev/null || echo "soft error - assumed device was mounted"
fi
mdadm --stop /dev/md0
mdadm --zero-superblock "${SSDS[#]}"
fi
# unmount disks from host filesystem
for i in {0..15}
do
umount "${SSDS[i]}" &> /dev/null || echo "${SSDS[i]} already unmounted"
done
if isMounted "/dev/sdb";
then
echo ""
echo "unmount failure - prevent raid0" 1>&2
exit 1
fi
# raid0 array
yes | mdadm --create /dev/md0 --force --level=0 --raid-devices=16 "${SSDS[#]}"
echo "raid array created"
# format
mkfs.ext4 -F /dev/md0
# mount, change /mnt/ssd-array to whatever
mkdir -p /mnt/ssd-array
mount /dev/md0 /mnt/ssd-array
chmod a+w /mnt/ssd-array
wait.sh: |
#!/bin/bash
while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do sleep 1; done
DeamonSet pod spec
spec:
hostPID: true
nodeSelector:
cloud.google.com/gke-local-ssd: "true"
volumes:
- name: setup-script
configMap:
name: local-ssds-setup
- name: host-mount
hostPath:
path: /tmp/setup
initContainers:
- name: local-ssds-init
image: marketplace.gcr.io/google/ubuntu1804
securityContext:
privileged: true
volumeMounts:
- name: setup-script
mountPath: /tmp
- name: host-mount
mountPath: /host
command:
- /bin/bash
- -c
- |
set -e
set -x
# Copy setup script to the host
cp /tmp/setup.sh /host
# Copy wait script to the host
cp /tmp/wait.sh /host
# Wait for updates to complete
/usr/bin/nsenter -m/proc/1/ns/mnt -- chmod u+x /tmp/setup/wait.sh
# Give execute priv to script
/usr/bin/nsenter -m/proc/1/ns/mnt -- chmod u+x /tmp/setup/setup.sh
# Wait for Node updates to complete
/usr/bin/nsenter -m/proc/1/ns/mnt /tmp/setup/wait.sh
# If the /tmp folder is mounted on the host then it can run the script
/usr/bin/nsenter -m/proc/1/ns/mnt /tmp/setup/setup.sh
containers:
- image: "gcr.io/google-containers/pause:2.0"
name: pause
For high performance use cases, use the Ephemeral storage on local SSDs GKE feature. All local SSDs will be configures as a (striped) raid0 array and mounted into the pod.
Quick summary:
Create the node pool or cluster with the option: --ephemeral-storage local-ssd-count=X
Schedule to nodes with cloud.google.com/gke-ephemeral-storage-local-ssd.
Add an emptyDir volume.
Mount it with volumeMounts.
Here's how I used it with a DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
nodeSelector:
cloud.google.com/gke-ephemeral-storage-local-ssd: "true"
volumes:
- name: localssd
emptyDir: {}
containers:
- name: myapp
image: <IMAGE>
volumeMounts:
- mountPath: /scratch
name: localssd
You can use DaemonSet yaml file to deploy the pod will run on startup, assuming already created a cluster with 2 local-SSD (this pod will be in charge of creating the Raid0 disk)
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
name: ssd-startup-script
labels:
app: ssd-startup-script
spec:
template:
metadata:
labels:
app: ssd-startup-script
spec:
hostPID: true
containers:
- name: ssd-startup-script
image: gcr.io/google-containers/startup-script:v1
imagePullPolicy: Always
securityContext:
privileged: true
env:
- name: STARTUP_SCRIPT
value: |
#!/bin/bash
sudo curl -s https://get.docker.com/ | sh
echo Done
The pod that will have access to the disk array in the above example is “/mnt/disks/ssd-array”
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: test-container
image: ubuntu
volumeMounts:
- mountPath: /mnt/disks/ssd-array
name: ssd-array
args:
- sleep
- "1000"
nodeSelector:
cloud.google.com/gke-local-ssd: "true"
tolerations:
- key: "local-ssd"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: ssd-array
hostPath:
path: /mnt/disks/ssd-array
After deploying the test-pod, SSH to the pod from your cloud-shell or any instance.
Then run :
kubectl exec -it test-pod -- /bin/bash
After that you should be able to see the created file in the ssd-array disk.
cat test-file.txt

Kubernetes with logrotate sidecar mount point issue

I am trying to deploy a test pod with nginx and logrotate sidecar.
Logrotate sidecar taken from: logrotate
My Pod yaml configuration:
apiVersion: v1
kind: Pod
metadata:
name: nginx-apache-log
labels:
app: nginx-apache-log
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
volumeMounts:
- name: logs
mountPath: /var/log
- name: logrotate
image: path/to/logrtr:sidecar
volumeMounts:
- name: logs
mountPath: /var/log
volumes:
- name: logs
emptyDir: {}
What I'd like to achieve is Logrotate container watching /var/log//.log, however with the configuration above, nginx container is failing because there is no /var/log/nginx:
nginx: [alert] could not open error log file: open() "/var/log/nginx/error.log" failed (2: No such file or directory)
2018/10/15 10:22:12 [emerg] 1#1: open() "/var/log/nginx/error.log" failed (2: No such file or directory)
However if I change mountPath for nginx from
mountPath: /var/log
to:
mountPath: /var/log/nginx
then it is starting, logging to /var/log/nginx/access.log and error.log, but logrotate sidecar sees all logs in /var/log not /var/log/nginx/. It is not a problem with just one nginx container, but I am planning to have more container apps logging to their own /var/log/appname folders.
Is there any way to fix/workaround that? I don't want to run sidecar for each app.
If I change my pod configuration to:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
volumeMounts:
- name: logs
mountPath: /var/log
initContainers:
- name: install
image: busybox
command:
- mkdir -p /var/log/nginx
volumeMounts:
- name: logs
mountPath: "/var/log"
then it is failing with:
Warning Failed 52s (x4 over 105s) kubelet, k8s-slave1 Error: failed to start container "install": Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "exec: \"mkdir -p /var/log/nginx\": stat mkdir -p /var/log/nginx: no such file or directory": unknown
Leave the mount path as /var/log. In your nginx container, execute mkdir /var/log/nginx in a startup script. You might have to tweak directory permissions a bit to make this work.
If you are running nginx in kubernetes, it is probably logging to stdout. When you run kubectl logs <nginx pod> nginx it will show you access and error logs. These logs are automatically logrotated by kubernetes, so you will not need a logrotate sidecar in this case.
If you are ever running pods that are not logging to stdout, this is a bit of an antipattern in kubernetes. It is more to your advantage to always log to stdout: kubernetes can take care of log rotation for you, and it is also easier to see logs with kubectl logs than by running kubectl exec and rummaging around in a running container

Why Kubernetes Pod gets into Terminated state giving Completed reason and exit code 0?

I am struggling to find any answer to this in the Kubernetes documentation. The scenario is the following:
Kubernetes version 1.4 over AWS
8 pods running a NodeJS API (Express) deployed as a Kubernetes Deployment
One of the pods gets restarted for no apparent reason late at night (no traffic, no CPU spikes, no memory pressure, no alerts...). Number of restarts is increased as a result of this.
Logs don't show anything abnormal (ran kubectl -p to see previous logs, no errors at all in there)
Resource consumption is normal, cannot see any events about Kubernetes rescheduling the pod into another node or similar
Describing the pod gives back TERMINATED state, giving back COMPLETED reason and exit code 0. I don't have the exact output from kubectl as this pod has been replaced multiple times now.
The pods are NodeJS server instances, they cannot complete, they are always running waiting for requests.
Would this be internal Kubernetes rearranging of pods? Is there any way to know when this happens? Shouldn't be an event somewhere saying why it happened?
Update
This just happened in our prod environment. The result of describing the offending pod is:
api:
Container ID: docker://7a117ed92fe36a3d2f904a882eb72c79d7ce66efa1162774ab9f0bcd39558f31
Image: 1.0.5-RC1
Image ID: docker://sha256:XXXX
Ports: 9080/TCP, 9443/TCP
State: Running
Started: Mon, 27 Mar 2017 12:30:05 +0100
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 24 Mar 2017 13:32:14 +0000
Finished: Mon, 27 Mar 2017 12:29:58 +0100
Ready: True
Restart Count: 1
Update 2
Here it is the deployment.yaml file used:
apiVersion: "extensions/v1beta1"
kind: "Deployment"
metadata:
namespace: "${ENV}"
name: "${APP}${CANARY}"
labels:
component: "${APP}${CANARY}"
spec:
replicas: ${PODS}
minReadySeconds: 30
revisionHistoryLimit: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
metadata:
labels:
component: "${APP}${CANARY}"
spec:
serviceAccount: "${APP}"
${IMAGE_PULL_SECRETS}
containers:
- name: "${APP}${CANARY}"
securityContext:
capabilities:
add:
- IPC_LOCK
image: "134078050561.dkr.ecr.eu-west-1.amazonaws.com/${APP}:${TAG}"
env:
- name: "KUBERNETES_CA_CERTIFICATE_FILE"
value: "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
- name: "NAMESPACE"
valueFrom:
fieldRef:
fieldPath: "metadata.namespace"
- name: "ENV"
value: "${ENV}"
- name: "PORT"
value: "${INTERNAL_PORT}"
- name: "CACHE_POLICY"
value: "all"
- name: "SERVICE_ORIGIN"
value: "${SERVICE_ORIGIN}"
- name: "DEBUG"
value: "http,controllers:recommend"
- name: "APPDYNAMICS"
value: "true"
- name: "VERSION"
value: "${TAG}"
ports:
- name: "http"
containerPort: ${HTTP_INTERNAL_PORT}
protocol: "TCP"
- name: "https"
containerPort: ${HTTPS_INTERNAL_PORT}
protocol: "TCP"
The Dockerfile of the image referenced in the above Deployment manifest:
FROM ubuntu:14.04
ENV NVM_VERSION v0.31.1
ENV NODE_VERSION v6.2.0
ENV NVM_DIR /home/app/nvm
ENV NODE_PATH $NVM_DIR/v$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/v$NODE_VERSION/bin:$PATH
ENV APP_HOME /home/app
RUN useradd -c "App User" -d $APP_HOME -m app
RUN apt-get update; apt-get install -y curl
USER app
# Install nvm with node and npm
RUN touch $HOME/.bashrc; curl https://raw.githubusercontent.com/creationix/nvm/${NVM_VERSION}/install.sh | bash \
&& /bin/bash -c 'source $NVM_DIR/nvm.sh; nvm install $NODE_VERSION'
ENV NODE_PATH $NVM_DIR/versions/node/$NODE_VERSION/lib/node_modules
ENV PATH $NVM_DIR/versions/node/$NODE_VERSION/bin:$PATH
# Create app directory
WORKDIR /home/app
COPY . /home/app
# Install app dependencies
RUN npm install
EXPOSE 9080 9443
CMD [ "npm", "start" ]
npm start is an alias for a regular node app.js command that starts a NodeJS server on port 9080.
Check the version of docker you run, and whether the docker daemon was restarted during that time.
If the docker daemon was restarted, all the container would be terminated (unless you use the new "live restore" feature in 1.12). In some docker versions, docker may incorrectly reports "exit code 0" for all containers terminated in this situation. See https://github.com/docker/docker/issues/31262 for more details.
If this is still relevant, we just had a similar problem in our cluster.
We have managed to find more information by inspecting the logs from docker itself. ssh onto your k8s node and run the following:
sudo journalctl -fu docker.service
I hade similar problem when we upgraded to version 2.x Pos get restarted even after the Dags ran successfully.
I later resolved it after a long time of debugging by overriding the pod template and specifying it in the airflow.cfg file.
[kubernetes]
....
pod_template_file = {{ .Values.airflow.home }}/pod_template.yaml
---
# pod_template.yaml
apiVersion: v1
kind: Pod
metadata:
name: dummy-name
spec:
serviceAccountName: default
restartPolicy: Never
containers:
- name: base
image: dummy_image
imagePullPolicy: IfNotPresent
ports: []
command: []