Kubeflow missing .kube/config files on local setup (Laptop/Desktop) - kubernetes

I have installed Kubeflow via MiniKF on my laptop.
I am trying to run a CNN on some MNIST data that uses TensorFlow 1.4.0. Currently, I am following this tutorial: https://codelabs.arrikto.com/codelabs/minikf-kale-katib-kfserving/index.html#0
The code is in a Jupyter Notebook server, and it runs completely fine. When I build a pipeline, it was completed successfully. But at the step for "Serving," when I run the kfserver command on my model, I get strange behavior: it gets stuck at the "Waiting for InferenceService."
An example screenshot from a successful run is shown below, where the process must end with the service being created:
enter image description here

Stuck at the "waiting for inferenceservice" means that the pipeline doesn't get the hostname of the servable. You can also validate this by running kubectl get inferenceservices, your new inference service should have it's READY state to True. If this is not the case, the deployment has failed.
There might be many different reasons why the inference service is not in ready state, there is a nice troubleshooting guide at kfserving's repo.

Related

How to wait for full cloud-initialization before VM is marked as running

I am currently configuring a virtual machine to work as an agent within Azure (with Ubuntu as image). In which the additional configuration is running through a cloud init file.
In which, among others, I have the below 'fix' within bootcmd and multiple steps within runcmd.
However the machine already gives the state running within the azure portal, while still running the cloud configuration phase (cloud_config_modules). This has as a result pipelines see the machine as ready for usage while not everything is installed/configured yet and breaks.
I tried a couple of things which did not result in the desired effect. After which I stumbled on the following article/bug;
The proposed solution worked, however I switched to a rhel image and it stopped working.
I noticed this image is not using walinuxagent as the solution states but waagent, so I tried to replacing that like the example below without any success.
bootcmd:
- mkdir -p /etc/systemd/system/waagent.service.d
- echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/waagent.service.d/override.conf
- sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
- systemctl daemon-reload
After this, also tried to set the runcmd steps to the bootcmd steps. This resulted in a boot which took ages and eventually froze.
Since I am not that familiar with rhel and Linux overall, I wanted to ask help if anyone might have some suggestions which I can additionally try.
(Apply some other configuration to ensure await on the cloud-final.service within a waagent?)
However the machine already had the state running, while still running the cloud configuration phase (cloud_config_modules).
Could you please be more specific? Where did you read the machine state?
The reason I ask is that cloud-init status will report status: running until cloud-init is done running, at which point it will report status: done
I what is the purpose of waiting until cloud-init is done? I'm not sure exactly what you are expecting to happen, but here are a couple of things that might help.
If you want to execute a script "at the end" of cloud-init initialization, you could put the script directly in runcmd, and if you want to wait for cloud-init in an external script you could do cloud-init status --wait, which will print a visual indicator and eventually return once cloud-init is complete.
On not too old Azure Linux VM images, cloud-init rather than WALinuxAgent acts as the VM provisioner. The VM is marked provisioned by the Azure cloud-init datasource module very early during cloud-init processing (source), before any cloud-init modules configurable with user data. WALinuxAgent is only responsible for provisioning Azure VM extensions. It does not appear to be possible to delay sending the 'VM ready' signal to Azure without modifying the VM image and patching the source code of cloud-init Azure datasource.

Connection from VS Code to Kubernetes failing

I am receiving an error message when trying to access details from VS Code of my Azure Kubernetes Cluster. This problem prevents me from attaching a debugger to the pod.
I receive the following error message:
Error loading document: Error: cannot open k8smsx://loadkubernetescore/pod-kube-system%20%20%20coredns-748cdb7bf4-q9f9x.yaml?ns%3Dall%26value%3Dpod%2Fkube-system%20%20%20coredns-748cdb7bf4-q9f9x%26_%3D1611398456559. Detail: Unable to read file 'k8smsx://loadkubernetescore/pod-kube-system coredns-748cdb7bf4-q9f9x.yaml?ns=all&value=pod/kube-system coredns-748cdb7bf4-q9f9x&_=1611398456559' (Get command failed: error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
)
My Setup
I have VS Code installed, with "Kubernetes", "Bridge to Kubernetes" and "Azure Kubernetes Service" installed
I have connected my Cluster through az login and can already access different information (e.g. my nodes, etc.)
When trying to access the workloads / pods on my cluster, I receive the above error message - and in the Kubernetes View in VS Code I get an error for the details of the pod.
Error in Kubernetes-View in VS Code
What I tried
I tried to reinstall the AKS Cluster and completely logging in freshly to it
I tried to reinstall all extensions mentioned above in VS Code
Browsing the internet, I do not find any comparable error message
The strange thing is that it used to work two weeks ago - and I did not change or update anything (as far as I remember)
Any ideas / hints that I can try further?
Thank you
As #mdaniel wrote: the Node view is just for human consumption, and that the tree item you actually want to click on is underneath Namespaces / kube-system / coredns-748cdb7bf4-q9f9x. Give that a try, and consider reporting your bad experience to their issue tracker since it looks like release 1.2.2 just came out 2 days ago and might not have been tested well.
final solution is to attach debugger in the other way - through Workloads / Deployments.

Deployed jobs stopped working with an image error?

In the last few hours I am no longer able to execute deployed Data Fusion pipeline jobs - they just end in an error state almost instantly.
I can run the jobs in Preview mode, but when trying to run deployed jobs this error appears in the logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Selected software image version '1.2.65-deb9' can no longer be used to create new clusters. Please select a more recent image
I've tried with both an existing instance and a new instance, and all deployed jobs including the sample jobs give this error.
Any ideas? I cannot find any config options for what image is used for execution
We are currently investigating an issue with the image for Cloud Dataproc used by Cloud Data Fusion. We had pinned a version of Dataproc VM image for the launch that is causing an issue.
We apologize for you inconvenience. We are working to resolve the issue as soon as possible for you.
Will provide update on this thread.
Nitin

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

Publish a Service Fabric Container project

I don't make it to publish a container. In my case I want to put an MVC4 Webrole into the container. ...but actually, what's inside the container does not matter.
Their primary tutorial for using a container to lift-and-shift old apps uses Continuous Delivery. The average user does not always need this.
Instead of Continuous Delivery one may use the Visual Studio's support for Docker Compose:
Connect-ServiceFabricCluster <mycluster> and then New-ServiceFabricComposeApplication -ApplicationName> <mytestapp> -Compose docker-compose.yml
But following exactly their tutorial still leads to errors. The applicaton appears in the cluster but outputs immediatly an error event:
"SourceId='System.Hosting', Property='Download:1.0:1.0'. Error during
download. Failed to download container image fabrikamfiber.web"
Do I miss a whole step, which they expect to be obvious? But even placing the image in my Docker Hub registry myself did not help? Or does it need to be Azure Container Registry?
Docker Hub should work fine, ACR is not required.
These blog posts may help:
about running containers
about docker compose on Service Fabric