Kubernetes 1.18.4, iSCSI - kubernetes

I have problems with connecting volume per iSCSI from Kubernetes. When I try with iscisiadm from worker node, it works. This is what I get from kubectl description pod.
Normal Scheduled <unknown> default-scheduler Successfully assigned default/iscsipd to k8s-worker-2
Normal SuccessfulAttachVolume 4m2s attachdetach-controller AttachVolume.Attach succeeded for volume "iscsipd-rw"
Warning FailedMount 119s kubelet, k8s-worker-2 Unable to attach or mount volumes: unmounted volumes=[iscsipd-rw], unattached volumes=[iscsipd-rw default-token-d5glz]: timed out waiting for the condition
Warning FailedMount 105s (x9 over 3m54s) kubelet, k8s-worker-2 MountVolume.WaitForAttach failed for volume "iscsipd-rw" : failed to get any path for iscsi disk, last err seen:iscsi: failed to attach disk: Error: iscsiadm: No records found(exit status 21)
I'm just using iscsi.yaml file from kubernetes.io!
---
apiVersion: v1
kind: Pod
metadata:
name: iscsipd
spec:
containers:
- name: iscsipd-rw
image: kubernetes/pause
volumeMounts:
- mountPath: "/mnt/iscsipd"
name: iscsipd-rw
volumes:
- name: iscsipd-rw
iscsi:
targetPortal: 192.168.34.32:3260
iqn: iqn.2020-07.int.example:sql
lun: 0
fsType: ext4
readOnly: true
Open-iscsi is installed on all worker nodes(just two of them).
● iscsid.service - iSCSI initiator daemon (iscsid)
Loaded: loaded (/lib/systemd/system/iscsid.service; enabled; vendor preset: e
Active: active (running) since Fri 2020-07-03 10:24:26 UTC; 4 days ago
Docs: man:iscsid(8)
Process: 20507 ExecStart=/sbin/iscsid (code=exited, status=0/SUCCESS)
Process: 20497 ExecStartPre=/lib/open-iscsi/startup-checks.sh (code=exited, st
Main PID: 20514 (iscsid)
Tasks: 2 (limit: 4660)
CGroup: /system.slice/iscsid.service
├─20509 /sbin/iscsid
└─20514 /sbin/iscsid
ISCSI Target is created on the IBM Storwize V7000. Without CHAP.
I tried to connect with iscsiadm from worker node and it works.
sudo iscsiadm -m discovery -t sendtargets -p 192.168.34.32
192.168.34.32:3260,1 iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1
192.168.34.34:3260,1 iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1
sudo iscsiadm -m node --login
Logging in to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.32,3260] (multiple)
Logging in to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.34,3260] (multiple)
Login to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.32,3260] successful.
Login to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.34,3260] successful.
Disk /dev/sdb: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 32768 bytes / 32768 bytes
Disklabel type: dos
Disk identifier: 0x5b3d0a3a
Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 209715199 209713152 100G 83 Linux
Is anyone facing the same problem?

Remember to not use a hostname for the target. Use the IP. For some reason, if the target is a hostname, it barfs with the error about requesting a duplicate session. If the target is an IP, it works fine. I now have multiple iSCSI targets mounted in various pods, and I am absolutely ecstatic.
You may also have authentication issue to your iscsi target.
If you don't use CHAP authentication yet, you still have to disable authentication. For example, if you use targetcli, you can run below commands to disable it.
$ sudo targetcli
/> /iscsi/iqn.2003-01.org.xxxx/tpg1 set attribute authentication=0 # will disable auth
/> /iscsi/iqn.2003-01.org.xxxx/tpg1 set attribute generate_node_acls=1 # will force to use tpg1 auth mode by default
If this doesn't help you, please share your iscsi target configuration, or guide that you followed.
What is important check if all of your nodes have the open-iscsi-package installed.
Take a look: kubernetes-iSCSI, volume-failed-iscsi-disk, iscsi-into-container-fails.

Related

Kubernetes: specify CPUs for cpumanager

Is it possible to specify CPU ID list to the Kubernetes cpumanager? The goal is to make sure pods get CPUs from a single socket (0). I brought all the CPUs on the peer socket offline as mentioned here, for example:
$ echo 0 > /sys/devices/system/cpu/cpu5/online
After doing this, the Kubernetes master indeed sees the remaining online CPUs
kubectl describe node foo
Capacity:
cpu: 56 <<< socket 0 CPU count
ephemeral-storage: 958774760Ki
hugepages-1Gi: 120Gi
memory: 197524872Ki
pods: 110
Allocatable:
cpu: 54 <<< 2 system reserved CPUs
ephemeral-storage: 958774760Ki
hugepages-1Gi: 120Gi
memory: 71490952Ki
pods: 110
System Info:
Machine ID: 1155420082478559980231ba5bc0f6f2
System UUID: 4C4C4544-0044-4210-8031-C8C04F584B32
Boot ID: 7fa18227-748f-496c-968c-9fc82e21ecd5
Kernel Version: 4.4.13
OS Image: Ubuntu 16.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.3
Kubelet Version: v1.11.1
Kube-Proxy Version: v1.11.1
However, cpumanager still seems to think there are 112 CPUs (socket0 + socket1).
cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-111"}
As a result, the kubelet system pods are throwing the following error:
kube-system kube-proxy-nk7gc 0/1 rpc error: code = Unknown desc = failed to update container "eb455f81a61b877eccda0d35eea7834e30f59615346140180f08077f64896760": Error response from daemon: Requested CPUs are not available - requested 0-111, available: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110 762 36d <IP address> foo <none>
I was able to get this working. Posting this as an answer so that someone in need might benefit.
It appears the CPU set is read from /var/lib/kubelet/cpu_manager_state file and it is not updated across kubelet restarts. So this file needs to be removed before restarting kubelet.
The following worked for me:
# On a running worker node, bring desired CPUs offline. (run as root)
$ cpu_list=`lscpu | grep "NUMA node1 CPU(s)" | awk '{print $4}'`
$ chcpu -d $cpu_list
$ rm -f /var/lib/kubelet/cpu_manager_state
$ systemctl restart kubelet.service
# Check the CPU set seen by the CPU manager
$ cat /var/lib/kubelet/cpu_manager_state
# Try creating pods and check the syslog:
Dec 3 14:36:05 k8-2-w1 kubelet[8070]: I1203 14:36:05.122466 8070 state_mem.go:84] [cpumanager] updated default cpuset: "0,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110"
Dec 3 14:36:05 k8-2-w1 kubelet[8070]: I1203 14:36:05.122643 8070 policy_static.go:198] [cpumanager] allocateCPUs: returning "2,4,6,8,58,60,62,64"
Dec 3 14:36:05 k8-2-w1 kubelet[8070]: I1203 14:36:05.122660 8070 state_mem.go:76] [cpumanager] updated desired cpuset (container id: 356939cdf32d0f719e83b0029a018a2ca2c349fc0bdc1004da5d842e357c503a, cpuset: "2,4,6,8,58,60,62,64")
I have reported a bug here as I think the CPU set should be updated after kubelet restarts.

Unable to launch Alluxio on Kubernetes

I am trying alluxio 1.7.1 with docker 1.13.1, kubernetes 1.9.6, 1.10.1
I created the alluxio docker image as per the instructions on https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Docker.html
Then I followed the https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Kubernetes.html guide to run alluxio on kubernetes. I was able to bring up the alluxio master pod properly, but when I try to bring up alluxio worker I get the error that Address in use. I have not modified anything in the yamls which I downloaded from alluxio git. Only change I did was for alluxio docker image name and api version in yamls for k8s to match properly.
I checked ports being used in my k8s cluster setup, and even on the nodes also. There are no ports that alluxio wants being used by any other process, but I still get address in use error. I am unable to understand what I can do to debug further or what I should change to make this work. I don't have any other application running on my k8s cluster setup. I tried with single node k8s cluster setup and multi node k8s cluster setup also. I tried k8s version 1.9 and 1.10 also.
There is definitely some issue from alluxio worker side which I am unable to debug.
This is the log that I get from worker pod:
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl logs po/alluxio-worker-knqt4
Formatting Alluxio Worker # vm-sushil-scrum1-08062018-alluxio-1
2018-06-08 10:09:55,723 INFO Configuration - Configuration file /opt/alluxio/conf/alluxio-site.properties loaded.
2018-06-08 10:09:55,845 INFO Format - Formatting worker data folder: /alluxioworker/
2018-06-08 10:09:55,845 INFO Format - Formatting Data path for tier 0:/dev/shm/alluxioworker
2018-06-08 10:09:55,856 INFO Format - Formatting complete
2018-06-08 10:09:56,357 INFO Configuration - Configuration file /opt/alluxio/conf/alluxio-site.properties loaded.
2018-06-08 10:09:56,549 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.194.11.7, rack=null)
2018-06-08 10:09:56,866 INFO BlockWorkerFactory - Creating alluxio.worker.block.BlockWorker
2018-06-08 10:09:56,866 INFO FileSystemWorkerFactory - Creating alluxio.worker.file.FileSystemWorker
2018-06-08 10:09:56,942 WARN StorageTier - Failed to verify memory capacity
2018-06-08 10:09:57,082 INFO log - Logging initialized #1160ms
2018-06-08 10:09:57,509 INFO AlluxioWorkerProcess - Domain socket data server is enabled at /opt/domain.
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:164)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:56)
Caused by: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:224)
at alluxio.worker.DataServer$Factory.create(DataServer.java:45)
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:159)
... 3 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
at io.netty.channel.unix.Errors.newIOException(Errors.java:117)
at io.netty.channel.unix.Socket.bind(Socket.java:259)
at io.netty.channel.epoll.EpollServerDomainSocketChannel.doBind(EpollServerDomainSocketChannel.java:75)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:504)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1226)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:213)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:305)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at java.lang.Thread.run(Thread.java:748)
-----------------------
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl get all
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/alluxio-worker 1 1 0 1 0 <none> 42m
ds/alluxio-worker 1 1 0 1 0 <none> 42m
NAME DESIRED CURRENT AGE
statefulsets/alluxio-master 1 1 44m
NAME READY STATUS RESTARTS AGE
po/alluxio-master-0 1/1 Running 0 44m
po/alluxio-worker-knqt4 0/1 Error 12 42m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/alluxio-master ClusterIP None <none> 19998/TCP,19999/TCP 44m
svc/kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 1h
---------------------
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl describe po/alluxio-worker-knqt4
Name: alluxio-worker-knqt4
Namespace: default
Node: vm-sushil-scrum1-08062018-alluxio-1/10.194.11.7
Start Time: Fri, 08 Jun 2018 10:09:05 +0000
Labels: app=alluxio
controller-revision-hash=3081903053
name=alluxio-worker
pod-template-generation=1
Annotations: <none>
Status: Running
IP: 10.194.11.7
Controlled By: DaemonSet/alluxio-worker
Containers:
alluxio-worker:
Container ID: docker://40a1eff2cd4dff79d9189d7cb0c4826a6b6e4871fbac65221e7cdd341240e358
Image: alluxio:1.7.1
Image ID: docker://sha256:b080715bd53efc783ee5f54e7f1c451556f93e7608e60e05b4615d32702801af
Ports: 29998/TCP, 29999/TCP, 29996/TCP
Command:
/entrypoint.sh
Args:
worker
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 08 Jun 2018 11:01:37 +0000
Finished: Fri, 08 Jun 2018 11:02:02 +0000
Ready: False
Restart Count: 14
Limits:
cpu: 1
memory: 2G
Requests:
cpu: 500m
memory: 2G
Environment Variables from:
alluxio-config ConfigMap Optional: false
Environment:
ALLUXIO_WORKER_HOSTNAME: (v1:status.hostIP)
Mounts:
/dev/shm from alluxio-ramdisk (rw)
/opt/domain from alluxio-domain (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-7xlz7 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
alluxio-ramdisk:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
alluxio-domain:
Type: HostPath (bare host directory volume)
Path: /tmp/domain
HostPathType: Directory
default-token-7xlz7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-7xlz7
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 56m kubelet, vm-sushil-scrum1-08062018-alluxio-1 MountVolume.SetUp succeeded for volume "alluxio-domain"
Normal SuccessfulMountVolume 56m kubelet, vm-sushil-scrum1-08062018-alluxio-1 MountVolume.SetUp succeeded for volume "alluxio-ramdisk"
Normal SuccessfulMountVolume 56m kubelet, vm-sushil-scrum1-08062018-alluxio-1 MountVolume.SetUp succeeded for volume "default-token-7xlz7"
Normal Pulled 53m (x5 over 56m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Container image "alluxio:1.7.1" already present on machine
Normal Created 53m (x5 over 56m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Created container
Normal Started 53m (x5 over 56m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Started container
Warning BackOff 1m (x222 over 55m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Back-off restarting failed container
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :19999 | grep LISTEN
java 8949 root 29u IPv4 12518521 0t0 TCP *:dnp-sec (LISTEN)
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :19998 | grep LISTEN
java 8949 root 19u IPv4 12520458 0t0 TCP *:iec-104-sec (LISTEN)
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :29998 | grep LISTEN
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :29999 | grep LISTEN
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :29996 | grep LISTEN
The alluxio-worker container is always restarting and failing again and again for the same error.
Please guide me what I can do to solve this.
Thanks
The problem was short circuit unix domain socket path. I was using whatever was present by default in alluxio git. In the default integration/kubernetes/conf/alluxio.properties.template the address for ALLUXIO_WORKER_DATA_SERVER_DOMAIN_SOCKET_ADDRESS was not complete. This is properly explained in https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Docker.html for enabling short circuit reads in alluxio worker containers using unix domain sockets.
Just because of a missing complete path for unix domain socket the alluxio worker was not able to come up in kubernetes when short circuit read was enabled for alluxio worker.
When I corrected the path in integration/kubernetes/conf/alluxio.properties for ALLUXIO_WORKER_DATA_SERVER_DOMAIN_SOCKET_ADDRESS=/opt/domain/d
Then things started wokring properly. Now also some tests are failing but alteast the alluxio setup is properly up. Now I will debug why some tests are failing.
I have submitted this fix in alluxio git for them to merge it in master branch.
https://github.com/Alluxio/alluxio/pull/7376
On the node where your worker is running, it seems that you have a port already in use.
Try to find which process is using it:
sudo lsof -n -i :80 | grep LISTEN
I read the alluxio configuration files: try with ports 19998, 19999, 29996, 29998, 29999 substituting 80 in the above command.

kubernetes pods stuck at containercreating

I have a raspberry pi cluster (one master , 3 nodes)
My basic image is : raspbian stretch lite
I already set up a basic kubernetes setup where a master can see all his nodes (kubectl get nodes) and they're all running.
I used a weave network plugin for the network communication
When everything is all setup i tried to run a nginx pod (first with some replica's but now just 1 pod) on my cluster as followed
kubectl run my-nginx --image=nginx
But somehow the pod get stuck in the status "Container creating" , when i run docker images i can't see the nginx image being pulled. And normally an nginx image is not that large so it had to be pulled already by now (15 minutes).
The kubectl describe pods give the error that the pod sandbox failed to create and kubernetes will rec-create it.
I searched everything about this issue and tried the solutions on stackoverflow (reboot to restart cluster, searched describe pods , new network plugin tried it with flannel) but i can't see what the actual problem is.
I did the exact same thing in Virtual box (just ubuntu not ARM ) and everything worked.
First i thougt it was a permission issue because i run everything as a normal user , but in vm i did the same thing and nothing changed.
Then i checked kubectl get pods --all-namespaces to verify that the pods for the weaver network and kube-dns are running and also nothing wrong over there .
Is this a firewall issue in Raspberry pi ?
Is the weave network plugin not compatible (even the kubernetes website says it is) with arm devices ?
I 'am guessing there is an api network problem and thats why i can't get my pod runnning on a node
[EDIT]
Log files
kubectl describe podName
>
> Name: my-nginx-9d5677d94-g44l6 Namespace: default Node: kubenode1/10.1.88.22 Start Time: Tue, 06 Mar 2018 08:24:13
> +0000 Labels: pod-template-hash=581233850
> run=my-nginx Annotations: <none> Status: Pending IP: Controlled By: ReplicaSet/my-nginx-9d5677d94 Containers:
> my-nginx:
> Container ID:
> Image: nginx
> Image ID:
> Port: 80/TCP
> State: Waiting
> Reason: ContainerCreating
> Ready: False
> Restart Count: 0
> Environment: <none>
> Mounts:
> /var/run/secrets/kubernetes.io/serviceaccount from default-token-phdv5 (ro) Conditions: Type Status
> Initialized True Ready False PodScheduled True
> Volumes: default-token-phdv5:
> Type: Secret (a volume populated by a Secret)
> SecretName: default-token-phdv5
> Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for
> 300s
> node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From
> Message ---- ------ ---- ----
> ------- Normal Scheduled 5m default-scheduler Successfully assigned my-nginx-9d5677d94-g44l6 to kubenode1 Normal
> SuccessfulMountVolume 5m kubelet, kubenode1 MountVolume.SetUp
> succeeded for volume "default-token-phdv5" Warning
> FailedCreatePodSandBox 1m kubelet, kubenode1 Failed create pod
> sandbox. Normal SandboxChanged 1m kubelet, kubenode1
> Pod sandbox changed, it will be killed and re-created.
kubectl logs podName
Error from server (BadRequest): container "my-nginx" in pod "my-nginx-9d5677d94-g44l6" is waiting to start: ContainerCreating
journalctl -u kubelet gives this error
Mar 12 13:42:45 kubeMaster kubelet[16379]: W0312 13:42:45.824314 16379 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Mar 12 13:42:45 kubeMaster kubelet[16379]: E0312 13:42:45.824816 16379 kubelet.go:2104] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
The problem seems to be with my network plugin. In my /etc/systemd/system/kubelet.service.d/10.kubeadm.conf . the flags for the network plugins are present ? environment= kubelet_network_args --cni-bin-dir=/etc/cni/net.d
--network-plugin=cni
Thank you all for responding to my question.
I solved my problem now. For anyone who has come to my question in the future the solution was as followed.
I cloned my raspberry pi images because i wanted a basicConfig.img for when i needed to add a new node to my cluster of when one gets down.
Weave network (the plugin i used) got confused because on every node and master the os had the same machine-id. When i deleted the machine id and created a new one (and reboot the nodes) my error got fixed.
The commands to do this was
sudo rm /etc/machine-id
sudo rm /var/lib/dbus/machine-id
sudo dbus-uuidgen --ensure=/etc/machine-id
Once again my patience was being tested. Because my kubernetes setup was normal and my raspberry pi os was normal. I founded this with the help of someone in the kubernetes community. This again shows us how important and great our IT community is. To the people of the future who will come to this question. I hope this solution will fix your error and will decrease the amount of time you will be searching after a stupid small thing.
You can see if it's network related by finding the node trying to pull the image:
kubectl describe pod <name> -n <namespace>
SSH to the node, and run docker pull nginx on it. If it's having issues pulling the image manually, then it might be network related.

Kubernetes ACS engine: containers (pods) do not have internet access

I'm using a Kubernetes cluster, deployed on Azure using the ACS-engine. My cluster is composed of 5 nodes.
1 master (unix VM) (v1.6.2)
2 unix agent (v1.6.2)
2 windows agent (v1.6.0-alpha.1.2959+451473d43a2072)
I have created a unix pod defined by the following YAML:
Name: ping-with-unix
Node: k8s-linuxpool1-25103419-0/10.240.0.5
Start Time: Fri, 30 Jun 2017 14:27:28 +0200
Status: Running
IP: 10.244.2.6
Controllers: <none>
Containers:
ping-with-unix-2:
Container ID:
Image: willfarrell/ping
Port:
State: Running
Started: Fri, 30 Jun 2017 14:27:29 +0200
Ready: True
Restart Count: 0
Environment:
HOSTNAME: google.com
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-1nrh5 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
default-token-1nrh5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-1nrh5
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: <none>
Events: <none>
This pod does not have internet access.
2017-06-30T12:27:29.512885000Z ping google.com every 300 sec
2017-06-30T12:27:29.521968000Z PING google.com (172.217.17.78): 56 data bytes
2017-06-30T12:27:39.698081000Z --- google.com ping statistics ---
2017-06-30T12:27:39.698305000Z 1 packets transmitted, 0 packets received, 100% packet loss
I created a 2nd pod, targeting windows container, with a custom docker image. This image instantiates an HttpClient and request an endpoint. It also does not have internet access. I can access the container to run interactive PowerShell. I cannot not access any DNS (due to lack of internet access).
PS C:\app> ping github.com
Ping request could not find host github.com. Please check the name and try again.
PS C:\app> ping 192.30.253.112
Pinging 192.30.253.112 with 32 bytes of data:
Request timed out.
Request timed out.
Ping statistics for 192.30.253.112:
Packets: Sent = 2, Received = 0, Lost = 2 (100% loss),
Control-C
What do I have to configure to allow my containers to access internet?
Remarks: I have not defined any Network policy.
I've updated my cluster using the api version '2017-07-01' and the kubernetes version '1.6.6'. Both my unix and windows pods have Internet access.
Note, for Windows pods :
Internet is available 2 or 3 minutes after the pod starts
I can't set the DnsPolicy to "Default", only "ClusterFirst" or "ClusterFirstWithHostNet" works.

How to create an kubernetes NFS volume on Google Container Engine

I am trying to create a kubernetes NFS volume on Google Container Engine (GKE) and get it used by a deployment.
I did this in several steps as it shown in this github repository kubernetes-nfs-volume-on-gke:
Create a GKE cluster and GCE persistent disk
Config the context for the kubectl to deal with the GKE cluster
Creation of the PersistentVolume (PV) and the PersistentVolumeClaim (PVC)
Creation of an NFS server
Create a service for the NFS server to expose it (the IP address of that service is used for the creation of the NFS PV and NFS PVC)
Creation of NFS volume
Create a Deployment of a busybox for checking the NFS volume is accessible.
After fellowing these step, this is the obtained error:
$ kubectl describe pods nfs-busybox-2762569073-lhb5p
Name: nfs-busybox-2762569073-lhb5p
Namespace: default
Node: gke-mappedinn-cluster-default-pool-f94cb0d4-fmfb/10.240.0.3
Start Time: Wed, 12 Apr 2017 04:12:20 +0400
Labels: name=nfs-busybox
pod-template-hash=2762569073
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"nfs-busybox-2762569073","uid":"b1e523ae-1f14-11e7-a084-42010a8e0...
kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container busybox
Status: Pending
IP:
Controllers: ReplicaSet/nfs-busybox-2762569073
Containers:
busybox:
Container ID:
Image: busybox
Image ID:
Port:
Command:
sh
-c
while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/mnt from my-pvc-nfs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-20n4b (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
my-pvc-nfs:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: nfs
ReadOnly: false
default-token-20n4b:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-20n4b
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
5m 5m 1 default-scheduler Normal Scheduled Successfully assigned nfs-busybox-2762569073-lhb5p to gke-mappedinn-cluster-default-pool-f94cb0d4-fmfb
3m 48s 2 kubelet, gke-mappedinn-cluster-default-pool-f94cb0d4-fmfb Warning FailedMount Unable to mount volumes for pod "nfs-busybox-2762569073-lhb5p_default(b1e7c901-1f14-11e7-a084-42010a8e0116)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-2762569073-lhb5p". list of unattached/unmounted volumes=[my-pvc-nfs]
3m 48s 2 kubelet, gke-mappedinn-cluster-default-pool-f94cb0d4-fmfb Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-2762569073-lhb5p". list of unattached/unmounted volumes=[my-pvc-nfs]
37s 37s 1 kubelet, gke-mappedinn-cluster-default-pool-f94cb0d4-fmfb Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/nfs/b1e7c901-1f14-11e7-a084-42010a8e0116-nfs" (spec.Name: "nfs") pod "b1e7c901-1f14-11e7-a084-42010a8e0116" (UID: "b1e7c901-1f14-11e7-a084-42010a8e0116") with: mount failed: exit status 32
Mounting command: /home/kubernetes/bin/mounter
Mounting arguments: 10.247.250.208:/exports /var/lib/kubelet/pods/b1e7c901-1f14-11e7-a084-42010a8e0116/volumes/kubernetes.io~nfs/nfs nfs []
Output: Running mount using a rkt fly container
run: group "rkt" not found, will use default gid when rendering images
In the kubernetes dashboard, the error is as follows:
Unable to mount volumes for pod "nfs-busybox-2762569073-lhb5p_default(b1e7c901-1f14-11e7-a084-42010a8e0116)": timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-2762569073-lhb5p". list of unattached/unmounted volumes=[my-pvc-nfs]
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"nfs-busybox-2762569073-lhb5p". list of unattached/unmounted volumes=[my-pvc-nfs]
Have I missed something?
Thanks,
This comment in the issue on kubernetes seems to solve this NFS issue on GKE.
Qutoing that comment:
Edit examples/volumes/nfs/nfs-pv.yaml change the last line to path: "/".
Edit examples/volumes/nfs/nfs-server-rc.yaml change the image to the one that enabled NFSv4 image: gcr.io/google_containers/volume-nfs:0.8
Also there are other issues where this is tracked here and here.