GKE Dataplane v2 NetworkPolicies not working - kubernetes

I am currently trying to move my calico based clusters to the new Dataplane V2, which is basically a managed Cilium offering.
For local testing, I am running k3d with open source cilium installed, and created a set of NetworkPolicies (k8s native ones, not CiliumPolicies), which lock down the desired namespaces.
My current issue is, that when porting the same Policies on a GKE cluster (with DataPlane enabled), those same policies don't work.
As an example let's take a look into the connection between some app and a database:
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: db-server.db-client
namespace: BAR
spec:
podSelector:
matchLabels:
policy.ory.sh/db: server
policyTypes:
- Ingress
ingress:
- ports: []
from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: FOO
podSelector:
matchLabels:
policy.ory.sh/db: client
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: db-client.db-server
namespace: FOO
spec:
podSelector:
matchLabels:
policy.ory.sh/db: client
policyTypes:
- Egress
egress:
- ports:
- port: 26257
protocol: TCP
to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: BAR
podSelector:
matchLabels:
policy.ory.sh/db: server
Moreover, using GCP monitoring tools we can see the expected and actual effect the policies have on connectivity:
Expected:
Actual:
And logs from the application trying to connect to the DB, and getting denied:
{
"insertId": "FOO",
"jsonPayload": {
"count": 3,
"connection": {
"dest_port": 26257,
"src_port": 44506,
"dest_ip": "172.19.0.19",
"src_ip": "172.19.1.85",
"protocol": "tcp",
"direction": "egress"
},
"disposition": "deny",
"node_name": "FOO",
"src": {
"pod_name": "backoffice-automigrate-hwmhv",
"workload_kind": "Job",
"pod_namespace": "FOO",
"namespace": "FOO",
"workload_name": "backoffice-automigrate"
},
"dest": {
"namespace": "FOO",
"pod_namespace": "FOO",
"pod_name": "cockroachdb-0"
}
},
"resource": {
"type": "k8s_node",
"labels": {
"project_id": "FOO",
"node_name": "FOO",
"location": "FOO",
"cluster_name": "FOO"
}
},
"timestamp": "FOO",
"logName": "projects/FOO/logs/policy-action",
"receiveTimestamp": "FOO"
}
EDIT:
My local env is a k3d cluster created via:
k3d cluster create --image ${K3SIMAGE} --registry-use k3d-localhost -p "9090:30080#server:0" \
-p "9091:30443#server:0" foobar \
--k3s-arg=--kube-apiserver-arg="enable-admission-plugins=PodSecurityPolicy,NodeRestriction,ServiceAccount#server:0" \
--k3s-arg="--disable=traefik#server:0" \
--k3s-arg="--disable-network-policy#server:0" \
--k3s-arg="--flannel-backend=none#server:0" \
--k3s-arg=feature-gates="NamespaceDefaultLabelName=true#server:0"
docker exec k3d-server-0 sh -c "mount bpffs /sys/fs/bpf -t bpf && mount --make-shared /sys/fs/bpf"
kubectl taint nodes k3d-ory-cloud-server-0 node.cilium.io/agent-not-ready=true:NoSchedule --overwrite=true
skaffold run --cache-artifacts=true -p cilium --skip-tests=true --status-check=false
docker exec k3d-server-0 sh -c "mount --make-shared /run/cilium/cgroupv2"
Where cilium itself is being installed by skaffold, via helm with the following parameters:
name: cilium
remoteChart: cilium/cilium
namespace: kube-system
version: 1.11.0
upgradeOnChange: true
wait: false
setValues:
externalIPs.enabled: true
nodePort.enabled: true
hostPort.enabled: true
hubble.relay.enabled: true
hubble.ui.enabled: true
UPDATE:
I have setup a third environment: a GKE cluster using the old calico CNI (Legacy dataplane) and installed cilium manually as shown here. Cilium is working fine, even hubble is working out of the box (unlike with the dataplane v2...) and I found something interesting. The rules behave the same as with the GKE managed cilium, but with hubble working I was able to see this:
For some reason cilium/hubble cannot identify the db pod and decipher its labels. And since the labels don't work, the policies that rely on those labels, also don't work.
Another proof of this would be the trace log from hubble:
Here the destination app is only identified via an IP, and not labels.
The question now is why is this happening?
Any idea how to debug this problem? What could be difference coming from? Do the policies need some tuning for the managed Cilium, or is a bug in GKE?
Any help/feedback/suggestion appreciated!

Update: I was able to solve the mystery and it was ArgoCD all along. Cilium is creating an Endpoint and Identity for each object in the namespace, and Argo was deleting them after deploying the applications.
For anyone who stumbles on this, the solution is to add this exclusion to ArgoCD:
resource.exclusions: |
- apiGroups:
- cilium.io
kinds:
- CiliumIdentity
- CiliumEndpoint
clusters:
- "*"

Related

how to use kubernetes scheduler.alpha.kubernetes.io/preferAvoidPods?

First all of, for some reasons, I'm using an unsupported and obsolete version of Kubernetes (1.12), and I can't upgrade.
I'm trying to configure the scheduler to avoid running pods on some nodes by changing the node score when the scheduler try to find the best available node, and I would like to do that on scheduler level and not by using nodeAffinity at deployment, replicaset, pod, etc level (therefore all pods will be affected by this change).
After reading the k8s docs here: https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins and checking that some options were already present in 1.12, I'm trying to use the NodePreferAvoidPods plugins.
In the documentation the plugin specifies:
Scores nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods
Which if understand correctly should do the work.
So, i've updated the static manifest for kube-scheduler.yaml to use the following config:
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
profiles:
- plugins:
score:
enabled:
- name: NodePreferAvoidPods
weight: 100
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
But adding the following annotation
scheduler.alpha.kubernetes.io/preferAvoidPods: to the node doesn't seem to work.
For testing I'm made a basic nginx deployment with a replica equal to the number of worker nodes (4).
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
Then I check where the pods where created with kubectl get pods -owide
So, I believe some options are required for this annotation to works.
I've tried to set the annotation to "true", "1" but k8s refuse my change and I can't figure what are the valid options for this annotation and I can't find any documentation about that.
I've checked within git release for 1.12, this plugin was already present (at least there are some lines of codes), I don't think the behavior or settings changed much since.
Thanks.
So from source Kubernetes codes here a valid value for this annoation:
{
"preferAvoidPods": [
{
"podSignature": {
"podController": {
"apiVersion": "v1",
"kind": "ReplicationController",
"name": "foo",
"uid": "abcdef123456",
"controller": true
}
},
"reason": "some reason",
"message": "some message"
}
]
}`
But there is no details on how to predict the uid and no answer where given when asked by another one on github years ago: https://github.com/kubernetes/kubernetes/issues/41630
For my initial question which was to avoid scheduling pods on node, I found an other method by using the well-known taint node.kubernetes.io/unschedulable and the value PreferNoSchedule
Tainting a node with this command do the job and this taint seem persistent across cordon/uncordon (a cordon will set to NoSchedule and uncordon will set it back to PreferNoSchedule).
kubectl taint node NODE_NAME node.kubernetes.io/unschedulable=:PreferNoSchedule

How to reference kubernetes docker-registry

I have installed docker-registry on Kubernetes via helm.
I am able to docker push to docker push 0.0.0.0:5000/<my-container>:v1 using port-forward.
Now how do I reference the images in the registry from a deployment.yaml?
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: <my-container>-deployment-v1
spec:
replicas: 1
template:
metadata:
labels:
app: <my-container>-deployment
version: v1
spec:
containers:
- name: <my-container>
image: 0.0.0.0:5000/<my-container>:v1 # <<< ????
imagePullPolicy: Always
ports:
- containerPort: 80
imagePullSecrets:
- name: private-docker-registry-secret
This do list my containers:
curl -X GET http://0.0.0.0:5000/v2/_catalog
I keep getting ImagePullBackOff when deploying.
I tyied using internal service name and cluster ip address, still not working.
Then tried using secrets:
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "running-buffoon-docker-registry-secret",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/secrets/running-buffoon-docker-registry-secret",
"uid": "127c93c1-53df-11e9-8ede-a63ad724d5b9",
"resourceVersion": "216488",
"creationTimestamp": "2019-03-31T18:01:56Z",
"labels": {
"app": "docker-registry",
"chart": "docker-registry-1.7.0",
"heritage": "Tiller",
"release": "running-buffoon"
}
},
"data": {
"haSharedSecret": "xxx"
},
"type": "Opaque"
}
And added the secret to to deployment.yaml:
imagePullSecrets:
- name: running-buffoon-docker-registry-secret
Then I get:
image "x.x.x.x/:<my-container>v1": rpc error: code = Unknown desc = Error response from daemon: Get https://x.x.x.x/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
You need to get the cluster-ip of your local docker registry.
You will find this in the dashboard - just visit the registry pod page and then to the associated service. Replace your image spec's 0.0.0.0 with the cluster ip. Also make sure the port matches - generally the port exposed by registry service is different from the actual port exposed inside the cluster. If you have authentication set up in your registry, you will need imagepullsecret as well.
I have blogged about minikube setup with a local registry - might be helpful. https://amritbera.com/journal/minikube-insecure-registry.html

Kubernetes Deployments and Init Containers

I learned recently that Kubernetes has a feature called Init Containers. Awesome, because I can use this feature to wait for my postgres service and create/migrate the database before my web application service runs.
However, it appears that Init Containers can only be configured in a Pod yaml file. Is there a way I can do this via a Deployment yaml file? Or do I have to choose?
To avoid confusion, ill answer your specific question. i agree with oswin that you may want to consider another method.
Yes, you can use init containers with a deployment. this is an example using the old style (pre 1.6) but it should work
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: 'nginx'
spec:
replicas: 1
selector:
matchLabels:
app: 'nginx'
template:
metadata:
labels:
app: 'nginx'
annotations:
pod.beta.kubernetes.io/init-containers: '[
{
"name": "install",
"image": "busybox",
"imagePullPolicy": "IfNotPresent",
"command": ["wget", "-O", "/application/index.html", "http://kubernetes.io/index.html"],
"volumeMounts": [
{
"name": "application",
"mountPath": "/application"
}
]
}
]'
spec:
volumes:
- name: 'application'
emptyDir: {}
containers:
- name: webserver
image: 'nginx'
ports:
- name: http
containerPort: 80
volumeMounts:
- name: 'application'
mountPath: '/application'
You probably want to use readiness probes instead of init containers for this use case. Check out this link and a blog. Also note that a deployment will not send traffic to a pod that is not reported ready - If that was your worry.
This is a well known pattern and a readiness probe in the web server would simply check the DB endpoint / data availability before reporting ready. This is a simple solution as opposed to the complexity of an extra init container and has the advantage of detecting DB outages correctly as well.

Kuberntes/Prometheus - Unable to Annotate in service file

My Kubernetes version is :
# kubectl --version
Kubernetes v1.4.0
I am planning to use Prometheus to monitor my Kube cluster. For this, I need to annotate the metrics URL.
My current metrics URL is like :
http://172.16.33.7:8080/metrics
But I want it to be like :
http://172.16.33.7:8080/websocket/metrics
First I tried to do this manually ::
kubectl annotate pods websocket-backend-controller-db83999c5b534b277b82badf6c152cb9m1 prometheus.io/path=/websocket/metrics
kubectl annotate pods websocket-backend-controller-db83999c5b534b277b82badf6c152cb9m1 prometheus.io/scrape='true'
kubectl annotate pods websocket-backend-controller-db83999c5b534b277b82badf6c152cb9m1 prometheus.io/port='8080'
All these commands work perfectly fine and I am able to see the annotations.
{
"metadata": {
"name": "websocket-backend-controller-v1krf",
"generateName": "websocket-backend-controller-",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/pods/websocket-backend-controller-v1krf",
"uid": "e323994b-4081-11e7-8bd0-0050569b6f44",
"resourceVersion": "27534379",
"creationTimestamp": "2017-05-24T13:07:06Z",
"labels": {
"name": "websocket-backend"
},
"annotations": {
"kubernetes.io/created-by": "{\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"ReplicationController\",\"namespace\":\"default\",\"name\":\"websocket-backend-controller\",\"uid\":\"e321f1a8-4081-11e7-8bd0-0050569b6f44\",\"apiVersion\":\"v1\",\"resourceVersion\":\"27531840\"}}\n",
"prometheus.io/path": "/websocket/metrics",
"prometheus.io/port": "8080",
"prometheus.io/scrape": "true"
}
But since I want this configuration to remain permanent, I am setting the following annotations in my services files.
# cat websocket-service.yaml
apiVersion: v1
kind: Service
metadata:
name: websocket-service
labels:
baseApi: websocket
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: /websocket/metrics
prometheus.io/port: '8080'
spec:
selector:
name: websocket-backend
ports:
- port: 8080
targetPort: 8080
nodePort: 30800
protocol: TCP
type: NodePort
clusterIP: 10.100.10.45
I restarted my websocket service and the corresponding pods but these configs do not seem to be taking effect.
kubectl create -f websocket-service.yaml
kubectl create -f ../controllers/websocket-replication-controller.yaml
The result does not show the annotations configured.
{
"metadata": {
"name": "websocket-backend-controller-v1krf",
"generateName": "websocket-backend-controller-",
"namespace": "default",
"selfLink": "/api/v1/namespaces/default/pods/websocket-backend-controller-v1krf",
"uid": "e323994b-4081-11e7-8bd0-0050569b6f44",
"resourceVersion": "27531879",
"creationTimestamp": "2017-05-24T13:07:06Z",
"labels": {
"name": "websocket-backend"
},
"annotations": {
"kubernetes.io/created-by": "{\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"ReplicationController\",\"namespace\":\"default\",\"name\":\"websocket-backend-controller\",\"uid\":\"e321f1a8-4081-11e7-8bd0-0050569b6f44\",\"apiVersion\":\"v1\",\"resourceVersion\":\"27531840\"}}\n"
}
All I am doing is rather than using a command line, I am setting the configs using services config but it does not seem to be working.
If you annotate the service, it doesn't take any effect on the possibly matched pods. Your pods are managed either by a ReplicationController, or over a ReplicaSet / Deployment. In that case, annotate these resources to make the annotations reach the pods. In example of deployments, you must use the template section, like:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
# Unique key of the Deployment instance
name: deployment-example
spec:
# 3 Pods should exist at all times.
replicas: 3
# Keep record of 2 revisions for rollback
revisionHistoryLimit: 2
template:
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/path: /websocket/metrics
prometheus.io/port: '8080'

Is there a way to add arbitrary records to kube-dns?

I will use a very specific way to explain my problem, but I think this is better to be specific than explain in an abstract way...
Say, there is a MongoDB replica set outside of a Kubernetes cluster but in a network. The ip addresses of all members of the replica set were resolved by /etc/hosts in app servers and db servers.
In an experiment/transition phase, I need to access those mongo db servers from kubernetes pods.
However, kubernetes doesn't seem to allow adding custom entries to /etc/hosts in pods/containers.
The MongoDB replica sets are already working with large data set, creating a new replica set in the cluster is not an option.
Because I use GKE, changing any of resources in kube-dns namespace should be avoided I suppose. Configuring or replace kube-dns to be suitable for my need are last thing to try.
Is there a way to resolve ip address of custom hostnames in a Kubernetes cluster?
It is just an idea, but if kube2sky can read some entries of configmap and use them as dns records, it colud be great.
e.g. repl1.mongo.local: 192.168.10.100.
EDIT: I referenced this question from https://github.com/kubernetes/kubernetes/issues/12337
There are 2 possible solutions for this problem now:
Pod-wise (Adding the changes to every pod needed to resolve these domains)
cluster-wise (Adding the changes to a central place which all pods have access to, Which is in our case is the DNS)
Let's begin with the pod-wise solution:
As of Kunbernetes 1.7, It's possible now to add entries to a Pod's /etc/hosts directly using .spec.hostAliases
For example: to resolve foo.local, bar.local to 127.0.0.1 and foo.remote,
bar.remote to 10.1.2.3, you can configure HostAliases for a Pod under
.spec.hostAliases:
apiVersion: v1
kind: Pod
metadata:
name: hostaliases-pod
spec:
restartPolicy: Never
hostAliases:
- ip: "127.0.0.1"
hostnames:
- "foo.local"
- "bar.local"
- ip: "10.1.2.3"
hostnames:
- "foo.remote"
- "bar.remote"
containers:
- name: cat-hosts
image: busybox
command:
- cat
args:
- "/etc/hosts"
The Cluster-wise solution:
As of Kubernetes v1.12, CoreDNS is the recommended DNS Server, replacing kube-dns. If your cluster originally used kube-dns, you may still have kube-dns deployed rather than CoreDNS. I'm going to assume that you're using CoreDNS as your K8S DNS.
In CoreDNS it's possible to Add an arbitrary entries inside the cluster domain and that way all pods will resolve this entries directly from the DNS without the need to change each and every /etc/hosts file in every pod.
First:
Let's change the coreos ConfigMap and add required changes:
kubectl edit cm coredns -n kube-system
apiVersion: v1
kind: ConfigMap
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
hosts /etc/coredns/customdomains.db example.org {
fallthrough
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . "/etc/resolv.conf"
cache 30
loop
reload
loadbalance
}
customdomains.db: |
10.10.1.1 mongo-en-1.example.org
10.10.1.2 mongo-en-2.example.org
10.10.1.3 mongo-en-3.example.org
10.10.1.4 mongo-en-4.example.org
Basically we added two things:
The hosts plugin before the kubernetes plugin and used the fallthrough option of the hosts plugin to satisfy our case.
To shed some more lights on the fallthrough option. Any given backend is usually the final word for its zone - it either returns a result, or it returns NXDOMAIN for the
query. However, occasionally this is not the desired behavior, so some of the plugin support a fallthrough option.
When fallthrough is enabled, instead of returning NXDOMAIN when a record is not found, the plugin will pass the
request down the chain. A backend further down the chain then has the opportunity to handle the request and that backend in our case is kubernetes.
We added a new file to the ConfigMap (customdomains.db) and added our custom domains (mongo-en-*.example.org) in there.
Last thing is to Remember to add the customdomains.db file to the config-volume for the CoreDNS pod template:
kubectl edit -n kube-system deployment coredns
volumes:
- name: config-volume
configMap:
name: coredns
items:
- key: Corefile
path: Corefile
- key: customdomains.db
path: customdomains.db
and finally to make kubernetes reload CoreDNS (each pod running):
$ kubectl rollout restart -n kube-system deployment/coredns
#OxMH answer is fantastic, and can be simplified for brevity. CoreDNS allows you to specify hosts directly in the hosts plugin (https://coredns.io/plugins/hosts/#examples).
The ConfigMap can therefore be edited like so:
$ kubectl edit cm coredns -n kube-system
apiVersion: v1
kind: ConfigMap
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
hosts {
10.10.1.1 mongo-en-1.example.org
10.10.1.2 mongo-en-2.example.org
10.10.1.3 mongo-en-3.example.org
10.10.1.4 mongo-en-4.example.org
fallthrough
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . "/etc/resolv.conf"
cache 30
loop
reload
loadbalance
}
You will still need to restart coredns so it rereads the config:
$ kubectl rollout restart -n kube-system deployment/coredns
Inlining the contents of the hostsfile removes the need to map the hostsfile from the configmap. Both approaches achieve the same outcome, it is up to personal preference as to where you want to define the hosts.
A type of External Name is required to access hosts or ips outside of the kubernetes.
The following worked for me.
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "tiny-server-5",
"namespace": "default"
},
"spec": {
"type": "ExternalName",
"externalName": "192.168.1.15",
"ports": [{ "port": 80 }]
}
}
For the record, an alternate solution for those not checking the referenced github issue.
You can define an "external" Service in Kubernetes, by not specifying any selector or ClusterIP. You have to also define a corresponding Endpoint pointing to your external IP.
From the Kubernetes documentation:
{
"kind": "Service",
"apiVersion": "v1",
"metadata": {
"name": "my-service"
},
"spec": {
"ports": [
{
"protocol": "TCP",
"port": 80,
"targetPort": 9376
}
]
}
}
{
"kind": "Endpoints",
"apiVersion": "v1",
"metadata": {
"name": "my-service"
},
"subsets": [
{
"addresses": [
{ "ip": "1.2.3.4" }
],
"ports": [
{ "port": 9376 }
]
}
]
}
With this, you can point your app inside the containers to my-service:9376 and the traffic should be forwarded to 1.2.3.4:9376
Limitations:
The DNS name used needs to be only letters, numbers or dashes. You can't use multi-level names (something.like.this). This means you probably have to modify your app to point just to your-service, and not yourservice.domain.tld.
You can only point to a specific IP, not a DNS name. For that, you can define a kind of a DNS alias with an ExternalName type Service.
UPDATE: 2017-07-03 Kunbernetes 1.7 now support Adding entries to Pod /etc/hosts with HostAliases.
The solution is not about kube-dns, but /etc/hosts.
Anyway, following trick seems to work so far...
EDIT: Changing /etc/hosts may has race condition with kubernetes system. Let it retry.
1) create a configMap
apiVersion: v1
kind: ConfigMap
metadata:
name: db-hosts
data:
hosts: |
10.0.0.1 db1
10.0.0.2 db2
2) Add a script named ensure_hosts.sh.
#!/bin/sh
while true
do
grep db1 /etc/hosts > /dev/null || cat /mnt/hosts.append/hosts >> /etc/hosts
sleep 5
done
Don't forget chmod a+x ensure_hosts.sh.
3) Add a wrapper script start.sh your image
#!/bin/sh
$(dirname "$(realpath "$0")")/ensure_hosts.sh &
exec your-app args...
Don't forget chmod a+x start.sh
4) Use the configmap as a volume and run start.sh
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
template:
...
spec:
volumes:
- name: hosts-volume
configMap:
name: db-hosts
...
containers:
command:
- ./start.sh
...
volumeMounts:
- name: hosts-volume
mountPath: /mnt/hosts.append
...
Use configMap seems better way to set DNS, but it's a little bit heavy when just add a few record (in my opinion). So I add records to /etc/hosts by shell script executed by docker CMD.
for example:
Dockerfile
...(ignore)
COPY run.sh /tmp/run.sh
CMD bash /tmp/run.sh
run.sh
#!/bin/bash
echo repl1.mongo.local 192.168.10.100 >> /etc/hosts
# some else command...
Notice, if your run MORE THAN ONE container in a pod, you have to add script in each container, because kubernetes start container randomly, /etc/hosts may be override by another container (which start later).