Minimun privileges for CSI sidecar - kubernetes

I'm building my own CSI driver with CSI standards and I'm wondering about the Security Context to be set for the CSI sidecar containers.
I'm going to use:
Node Driver Registrar
CSI provisioner
CSI attacher
CSI liveness probe.
Some of them need to run as root and I'm wondering about the configuration in the Security Context to assign them the minimum Linux capabilities and to be sure that root capabilities are provided for the minimum time.
Am I forced to set the security context as follows? Is there any way to restrict it furthermore?
securityContext:
allowPrivilegeEscalation: true
privileged: false
runAsNonRoot: true
capabilities:
drop:
- all
add:
- SYS_ADMIN
Thanks in advance,
Antonio

Based on research about kubernetes and linux capabilities, that looks you've already found the least possible privileges.
Your example contains minimum needed capability - CAP_SYS_ADMIN which is used primarily for mounting and unmounting filesystems.
In more details CAP_SYS_ADMIN is used for:
Perform a range of system administration operations including: quotactl(2), mount(2), umount(2), pivot_root(2), swapon(2), swapoff(2), sethostname(2),
and setdomainname(2);
use ioprio_set(2) to assign IOPRIO_CLASS_RT and (before Linux 2.6.25) IOPRIO_CLASS_IDLE I/O scheduling classes;
exceed /proc/sys/fs/file-max, the system-wide limit on
the number of open files, in system calls that open
files (e.g., accept(2), execve(2), open(2), pipe(2));
call setns(2) (requires CAP_SYS_ADMIN in the target
namespace);
employ the TIOCSTI ioctl(2) to insert characters into
the input queue of a terminal other than the caller's
controlling terminal;
employ the obsolete nfsservctl(2) system call;
employ the obsolete bdflush(2) system call;
perform various privileged block-device ioctl(2)
operations;
perform various privileged filesystem ioctl(2) operations;
perform privileged ioctl(2) operations on the /dev/random device (see random(4));
perform administrative operations on many device drivers;
Source - capabilities(7) — Linux manual page
Also there is a very good article with many details about docker images and security aspects can be found here - Towards unprivileged container builds
One more article explains linux capabilities and its appliance with examples may be helpful - HackTricks - Linux Capabilities

Related

Does NFS retrans effect the application services?

We had an issue, where one of our kubernetes service not able to read the certificates stored in NFS volume. I could see from NFS stats there were retrans happended ( 33 times ) from the status on that particular time. Does the retrans cause any issue with the application service ?
Also, we had issue for a service only in one vm, other services running on different vm but uses the same NFS dont have any isssues.
Here in the above scenerio, we were able to fix the issue - by restarting the service.
Yes, NFS retrans can affect an application's service. The configured duration of nfs retrans varies. If there are several timeouts beyond what an application can endure gracefully, then yes this could be a problem. NFS performance can be dependent on the network (e.g., the proximity of one server to another, and congestion between the two servers).
NFS may facilitate synchronous communication; NFS relies on network connectivity, and its performance varies. Applications that rely on NFS may have minimum performance levels that NFS cannot meet due to delays or performance issues.
Oracle says "one of the major factors affecting NFS performance is the retransmission rate."
RedHat in their man page for NFS says that file operations can be aborted or will involve a "server not responding" message if enough retransmissions happen.
Modern messaging systems facilitate asynchronous communication. There has been a recent widespread adoption of messaging tools in the industry. Some companies re-architect back-end systems to leverage messaging systems. However NFS can still be useful to support application services and possibly the data requirements.
This 20 year old book may be helpful: Managing NFS and NIS: Help for Unix System Administrators Second Edition by Mike Eisler, Ricardo Labiaga, and Hal Stern.
Please refer doc1 and doc2.

Tomcat in k8s pod and db in cloud - slow connection

I have tomcat, zookeeper and kafka deployled in local k8s(kind) cluster. The database is remote i.e. in cloud. The pages load very slowly.
But when i moved tomcat outside of the pod and started manually with zk and kafka in local k8s cluster and db in remote cloud the pages are loading fine.
Why is Tomcat very slow when inside a Kubernetes pod?
In theory, a program running in a container can run as fast as a program running on the host machine.
In practice, there are many things that can affect the performance.
When running on Windows or macOS (for instance with Docker Desktop), container doesn't run directly on the machine, but in a small Linux virtual machine. This VM will add a bit of overhead, and it might not have as much CPU and RAM as the host environment. One way to have a look at the resource usage of containers is to use docker stats; or docker run -ti --pid host alpine and then use classic UNIX tools like free, top, vmstat, ... to see the resource usage in the VM.
In most environments (at least with Docker, and with Kubernetes clusters in their most common default configurations), containers run without resource constraints and limits. However, it is fairly common (and, in fact, highly recommended!) to set resource requests and limits when running containers on Kubernetes. You can check resource limits of a pod with kubectl describe. If metrics-server is installed (which is recommended, even on dev/staging environments), you can check resource usage with kubectl top. Tools like k9s will show you resource requests, limits, and usage in a comprehensive way (as long as the data is available; i.e. you still need to install metrics-server to obtain pod metrics, for instance).
In addition to the VM overhead described above, if the container does a lot of I/O (whether it's disk or network), there might be a bit of overhead in comparison to a native process. This can become noticeable if the container writes on the container copy-on-write filesystem (instead of a volume), especially when using the device-mapper storage driver.
Applications that use "live reload" techniques (that automatically rebuild or restart when source code is edited) are particularly prone to this I/O issue, because there are unfortunately no efficient methods to watch file modifications across a virtual machine boundary. This means that many web frameworks exhibit extreme performance degradations when running in containers on Mac or Windows when the source code is mounted to the container.
In addition to these factors, there can be other subtle differences that might affect the overall performance of a containerized application. When observing performance issues, it is very helpful to use a profiler (or some kind of APM solution) to see which parts of the code take longer to execute. If no profiler or APM is available, try to execute individual portions of the code independently to compare their performance. For instance, have a small piece of code that executes a single query to the database; or executes a single task from a job queue, etc.
Good luck!

Pin Kubernetes pods/deployments/replica sets/daemon sets to run on specific cpu only

I need to restrict an app/deployment to run on specific cpus only (say 0-3 or just 1 or 2 etc.) I found out about CPU Manager and tried implement it with static policy but not able to achieve what I intend to.
I tried the following so far:
Enabled cpu manager static policy on kubelet and verified that it is enabled
Reserved the cpu with --reserved-cpus=0-3 option in the kubelet
Ran a sample nginx deployment with limits equal to requests and cpu of integer value i.e. QoS of guaranteed is ensured and able to validate the cpu affinity with taskset -c -p $(pidof nginx)
So, this makes my nginx app to be restricted to run on all cpus other than reserved cpus (0-3), i.e. if my machine has 32 cpus, the app can run on any of the 4-31 cpus. And so can any other apps/deployments that will run. As I understand, the reserved cpus 0-3 will be reserved for system daemons, OS daemons etc.
My questions-
Using the Kubernetes CPU Manager features, is it possible to pin certain cpu to an app/pod (in this case, my nginx app) to run on a specific cpu only (say 2 or 3 or 4-5)? If yes, how?
If point number 1 is possible, can we perform the pinning at container level too i.e. say Pod A has two containers Container B and Container D. Is it possible to pin cpu 0-3 to Container B and cpu 4 to Container B?
If none of this is possible using Kubernetes CPU Manager, what are the alternatives that are available at this point of time, if any?
As I understand your question, you want to set up your dedicated number of CPU for each app/pod. As I've searched.
I am only able to find some documentation that might help. The other one is a Github topic I think this is a workaround to your problem.
This is a disclaimer, based from what I've read, searched and understand there is no direct solution for this issue, only workarounds. I am still searching further for this.

Performance Postgresql on Local Volume K8s

currently I recently switched our PostgreSQL cluster from a simple "bare-metal" (vms) workload to a containerised K8s cluster (also on vms).
Currently we run zalando-incubator/postgres-operator and use Local Volume's with volumeMode: FileSystem the volume itself is a "simple" xfs volume mounted on the host.
However we actually seen performance drops up to 50% on the postgres cluster inside k8s.
Some heavy join workloads actually perform way worse than on the old cluster which did not use containers at all.
Is there a way to tune the behavior or at least measure the performance of I/O to find the actual bottleneck (i.e. what is a good way to measure I/O, etc.)
Is there a way to tune the behavior
Be cognizant of two things that might be impacting your in-cluster behavior: increased cache thrashing and the inherent problem of running concurrent containers on a Node. If you haven't already tried it, you may want to use taints and tolerations to sequester your PG Pods away from other Pods and see if that helps.
what is a good way to measure I/O, etc.
I would expect the same iostat tools one is used to using would work on the Node, since no matter how much kernel namespace trickery is going on, it's still the Linux kernel.
Prometheus (and likely a ton of other such toys) surfaces some I/O specific metrics for containers, and I would presume they are at the scrape granularity, meaning you can increase the scrape frequency, bearing in mind the observation cost impacting your metrics :-(
It appears new docker daemons ship with Prom metrics, although I don't know what version introduced that functionality. There is a separate page discussing the implications of high frequency metric collection. There also appears to be a Prometheus exporter for monitoring arbitrary processes, above and beyond the PostgreSQL specific exporter.
Getting into my opinion, it may be a very reasonable experiment to go head-to-head with ext4 versus a non-traditional FS like xfs. I can't even fathom how much extra production experience has gone into ext4, merely by the virtue of almost every Linux on the planet deploying on it by default. You may have great reasons for using xfs, but I just wanted to ensure you had at least considered that xfs might have performance characteristics that make it problematic in a shared environment like a kubernetes cluster.

What is the Google Cloud Platform's "Managed Infrastructure Mixer Client"?

Can someone tell me what the purpose of the “Managed Infrastructure Mixer Client”? I have it showing up on my GCE logs and I can’t find any information on it. It is adding and removing GCE instances.
I believe it is related to GCP's recommended settings:
Automatic restart - On (recommended)
On host maintenance - Migrate VM instance (recommended)
This is the User Agent used by Managed Instance Groups when performing operations on instances. These operations can result from both user operating on the MIG (e.g. resizing, recreating instances), as well as operations performed by Autoscaler, Autohealer, Updater, etc.
Note that this string may change in the future.