Spark Operator and jmx_exporter failing - kubernetes

I've just migrated k8s to 1.22 and with this version spark-operator:1.2.3 didn't work.
I've followed the info at the internet and upgraded to 1.3.3, however all my spark apps are failing with the same error:
Caused by: java.io.FileNotFoundException: /etc/metrics/conf/prometheus.yaml (No such file or directory) at java.base/java.io.FileInputStream.open0(Native Method) at java.base/java.io.FileInputStream.open(FileInputStream.java:219) at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157) at java.base/java.io.FileReader.<init>(FileReader.java:75) at io.prometheus.jmx.shaded.io.prometheus.jmx.JmxCollector.<init>(JmxCollector.java:78) at io.prometheus.jmx.shaded.io.prometheus.jmx.JavaAgent.premain(JavaAgent.java:29) ... 6 more *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at ./src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422 FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed
It used to work on previous version....
unfortunately, I cannot downgrade k8s.
Can you please assist?
PS: there are no additional options passed to executor, just a path to jmx_exporter_0.15

I think your new application requires that prometheus be running in your cluster and it also expects to find the configuration file for prometheus at the path /etc/metrics/conf/prometheus.yaml. Such files are generally setup by creating a ConfigMap in your cluster and then mounting it to every pod that needs it.
My guess is, during the upgrade of spark a step was missed/not provided which was to install prometheus in your cluster before installing your spark applications which used that installation as a dependency. This is the case since you are trying to use a prometheus exporter and if a prometheus installation doesn't exist already, it will not work.
You can try going through the installation again and checking where prometheus comes into play and ensure that this configuration file is provided to your applications.

Related

Azure Data Factory run New Job Cluster Mode Databricks Python Wheel

We are trying to install the external libraries via Azure data factory. After that we are planning to execute our notebook. Inside the notebook we will be using many different libraries to achieve the business logic.
In the Azure data factory, there is the Append Libraries option from where is possible to install new runtime libraries to the job cluster.
Our linked service connects always to a NEW JOB CLUSTER but we are getting below error while execute the ADF pipelines.
Run result unavailable: job failed with error message Library
installation failed for library due to user error for whl:
"dbfs:/FileStore/jars/ephem-4.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl"
. Error messages: Library installation attempted on the driver node of
cluster 1226-023738-9cm6lm7d and failed. Please refer to the following
error message to fix the library or contact Databricks support.
Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE.
Error Message:
java.util.concurrent.ExecutionException:
java.io.FileNotFoundException:
dbfs:/FileStore/jars/ephem-4.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Can't submit new job via gui on standalone kubernetes flink deployment (session mode)

After deploy a flink in standalone kubernetes mode (session cluster) i can't upload any new job using flink GUI. After click +Add New button and choosing jar file, the progress strap ends and nothing happens.
There is no information/error on Job Manager logs about this.
When I try to upload any file (eg. text file) I get an error, and there is an info at the log:
"Exception occured in REST handler: Only Jar files are allowed."
I've also try to upload fake jar (an empty file called .jar) and it works - I can upload this kind of file.
I have a brand new, clean Apache Flink cluster running on Kubernetes cluster.
I have used docker hub image and I've try two different versions:
*1.13.2-scala_2.12-java8, and
1.13-scala_2.11-java8*
But the result was the same on both versions.
My deployment are based on this howto:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/
and I've used yaml files provided in Appendix #
Common cluster resource definitions # to this article:
flink-configuration-configmap.yaml
jobmanager-service.yaml
taskmanager-session-deployment.yaml
jobmanager-session-deployment-non-ha.yaml
I'e also used ingress controller to publish GUI running on 8081 on jobmanager.
I have tree pods (1 jobmanager, 2 task managers) and can't see any errors from flink logs.
Any suggestions what I'm missing, or when to find any errors ?
Problem solved. Problem was caused by nginx upload limit (default is 1024kb). Flink GUI are published outside Kubernetes using ingress controller and nginx.
When we try to upload job files bigger than 1MB (1024kb), nginx limit prevented to do it. Jobs with size below this limit (for example fake jar with 0 kb size) was uploaded successfully

'confluent local services start' return a not a directory error

I'm trying to use confluent CLI for the first time. I've successfully installed it and I can run the confluent command. But when I try to start confluent with the command confluent local services start I have this error:
Obviously there is a bad configuration issue with the path to the zookeeper.properties but I can't find how this path is built. It seems that it takes my CONFLUENT_HOME variable and add "etc/kafka/zookeeper.properties" to it but I don't understand why.
Can someone explain what I'm doing wrong?

Connection from VS Code to Kubernetes failing

I am receiving an error message when trying to access details from VS Code of my Azure Kubernetes Cluster. This problem prevents me from attaching a debugger to the pod.
I receive the following error message:
Error loading document: Error: cannot open k8smsx://loadkubernetescore/pod-kube-system%20%20%20coredns-748cdb7bf4-q9f9x.yaml?ns%3Dall%26value%3Dpod%2Fkube-system%20%20%20coredns-748cdb7bf4-q9f9x%26_%3D1611398456559. Detail: Unable to read file 'k8smsx://loadkubernetescore/pod-kube-system coredns-748cdb7bf4-q9f9x.yaml?ns=all&value=pod/kube-system coredns-748cdb7bf4-q9f9x&_=1611398456559' (Get command failed: error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
)
My Setup
I have VS Code installed, with "Kubernetes", "Bridge to Kubernetes" and "Azure Kubernetes Service" installed
I have connected my Cluster through az login and can already access different information (e.g. my nodes, etc.)
When trying to access the workloads / pods on my cluster, I receive the above error message - and in the Kubernetes View in VS Code I get an error for the details of the pod.
Error in Kubernetes-View in VS Code
What I tried
I tried to reinstall the AKS Cluster and completely logging in freshly to it
I tried to reinstall all extensions mentioned above in VS Code
Browsing the internet, I do not find any comparable error message
The strange thing is that it used to work two weeks ago - and I did not change or update anything (as far as I remember)
Any ideas / hints that I can try further?
Thank you
As #mdaniel wrote: the Node view is just for human consumption, and that the tree item you actually want to click on is underneath Namespaces / kube-system / coredns-748cdb7bf4-q9f9x. Give that a try, and consider reporting your bad experience to their issue tracker since it looks like release 1.2.2 just came out 2 days ago and might not have been tested well.
final solution is to attach debugger in the other way - through Workloads / Deployments.

kubelet Error while processing event /sys/fs/cgroup/memory/libcontainer_10010_systemd_test_default.slice

I have setup Kubernetes 1.15.3 cluster on Centos 7 OS using systemd cgroupfs. on all my nodes syslog started logging this message frequently.
How to fix this error message?
kubelet: W0907 watcher.go:87 Error while processing event ("/sys/fs/cgroup/memory/libcontainer_10010_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent: no such file or directory
Thanks
It's a known issue with a bad interaction with runc; someone observed it is actually caused by a repeated etcd health check but that wasn't my experience on Ubuntu, which exhibits that same behavior on every Node
They allege that updating the runc binary on your hosts will make the problem go away, but I haven't tried that myself
I had exactly the same problem with the same kubernetes version and with the same context -that is changing cgroups to systemd. Github ticket for this error is created here.
After changing container runtime, as it is described in this tutorial to systemd error start popping out in kublete service log.
What worked for me was to update docker and containerd to following versions.
docker: v19.03.5
containerd: v1.2.10
I assume that any version higher than above will fix the problem as well.