Argo Workflow + Spark operator + App logs not generated - kubernetes

Am in very early stages of exploring Argo with Spark operator to run Spark samples on the minikube setup on my EC2 instance.
Following are the resources details, not sure why am not able to see the spark app logs.
WORKFLOW.YAML
kind: Workflow
metadata:
name: spark-argo-groupby
spec:
entrypoint: sparkling-operator
templates:
- name: spark-groupby
resource:
action: create
manifest: |
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
generateName: spark-argo-groupby
spec:
type: Scala
mode: cluster
image: gcr.io/spark-operator/spark:v3.0.3
imagePullPolicy: Always
mainClass: org.apache.spark.examples.GroupByTest
mainApplicationFile: local:///opt/spark/spark-examples_2.12-3.1.1-hadoop-2.7.jar
sparkVersion: "3.0.3"
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.0.0
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.0.0
- name: sparkling-operator
dag:
tasks:
- name: SparkGroupBY
template: spark-groupby
ROLES
# Role for spark-on-k8s-operator to create resources on cluster
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: spark-cluster-cr
labels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
rules:
- apiGroups:
- sparkoperator.k8s.io
resources:
- sparkapplications
verbs:
- '*'
---
# Allow airflow-worker service account access for spark-on-k8s
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: argo-spark-crb
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: spark-cluster-cr
subjects:
- kind: ServiceAccount
name: default
namespace: argo
ARGO UI
To dig deep I tried all the steps that's listed on https://dev.to/crenshaw_dev/how-to-debug-an-argo-workflow-31ng yet could not get app logs.
Basically when I run these examples am expecting spark app logs to be printed - in this case output of following Scala example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/GroupByTest.scala
Interesting when I list PODS, I was expecting to see driver pods and executor pods but always see only one POD and it has above logs as in the image attached. Please help me to understand why logs are not generated and how can I get it?
RAW LOGS
$ kubectl logs spark-pi-dag-739246604 -n argo
time="2021-12-10T13:28:09.560Z" level=info msg="Starting Workflow Executor" version="{v3.0.3 2021-05-11T21:14:20Z 02071057c082cf295ab8da68f1b2027ff8762b5a v3.0.3 clean go1.15.7 gc linux/amd64}"
time="2021-12-10T13:28:09.581Z" level=info msg="Creating a docker executor"
time="2021-12-10T13:28:09.581Z" level=info msg="Executor (version: v3.0.3, build_date: 2021-05-11T21:14:20Z) initialized (pod: argo/spark-pi-dag-739246604) with template:\n{\"name\":\"sparkpi\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"resource\":{\"action\":\"create\",\"manifest\":\"apiVersion: \\\"sparkoperator.k8s.io/v1beta2\\\"\\nkind: SparkApplication\\nmetadata:\\n generateName: spark-pi-dag\\nspec:\\n type: Scala\\n mode: cluster\\n image: gjeevanm/spark:v3.1.1\\n imagePullPolicy: Always\\n mainClass: org.apache.spark.examples.SparkPi\\n mainApplicationFile: local:///opt/spark/spark-examples_2.12-3.1.1-hadoop-2.7.jar\\n sparkVersion: 3.1.1\\n driver:\\n cores: 1\\n coreLimit: \\\"1200m\\\"\\n memory: \\\"512m\\\"\\n labels:\\n version: 3.0.0\\n executor:\\n cores: 1\\n instances: 1\\n memory: \\\"512m\\\"\\n labels:\\n version: 3.0.0\\n\"},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio:9000\",\"bucket\":\"my-bucket\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"my-minio-cred\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"my-minio-cred\",\"key\":\"secretkey\"},\"key\":\"spark-pi-dag/spark-pi-dag-739246604\"}}}"
time="2021-12-10T13:28:09.581Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-12-10T13:28:09.581Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2021-12-10T13:28:10.348Z" level=info msg=argo/SparkApplication.sparkoperator.k8s.io/spark-pi-daghhl6s
time="2021-12-10T13:28:10.348Z" level=info msg="Starting SIGUSR2 signal monitor"
time="2021-12-10T13:28:10.348Z" level=info msg="No output parameters"

As Michael mentioned in his answer, Argo Workflows does not know how other CRDs (such as SparkApplication that you used) work and thus could not pull the logs from the pods created by that particular CRD.
However, you can add the label workflows.argoproj.io/workflow: {{workflow.name}} to the pods generated by SparkApplication to let Argo Workflows know and then use argo logs -c <container-name> to pull the logs from those pods.
You can find an example here but Kubeflow CRD but in your case you'll want to add labels to the executor and driver to your SparkApplication CRD in the resource template: https://github.com/argoproj/argo-workflows/blob/master/examples/k8s-resource-log-selector.yaml

Argo Workflows' resource templates (like your spark-groupby template) are simplistic. The Workflow controller is running kubectl create, and that's where its involvement in the SparkApplication ends.
The logs you're seeing from the Argo Workflow pod describe the kubectl create process. Your resource is written to a temporary yaml file and then applied to the cluster.
time="2021-12-10T13:28:09.581Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-12-10T13:28:09.581Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2021-12-10T13:28:10.348Z" level=info msg=argo/SparkApplication.sparkoperator.k8s.io/spark-pi-daghhl6s
Old answer:
To view the logs generated by your SparkApplication, you'll need to
follow the Spark docs. I'm not familiar, but I'm guessing the
application gets run in a Pod somewhere. If you can find that pod, you
should be able to view the logs with kubectl logs.
It would be really cool if Argo Workflows could pull Spark logs into
its UI. But building a generic solution would probably be
prohibitively difficult.
Update:
Check Yuan's answer. There's a way to pull the Spark logs into the Workflows CLI!

Related

How do I fix a role-based problem when my role appears to have the correct permissions?

I am trying to establish the namespace "sandbox" in Kubernetes and have been using it for several days for several days without issue. Today I got the below error.
I have checked to make sure that I have all of the requisite configmaps in place.
Is there a log or something where I can find what this is referring to?
panic: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
I did find this (MountVolume.SetUp failed for volume "kube-api-access-fcz9j" : object "default"/"kube-root-ca.crt" not registered) thread and have applied the below patch to my service account, but I am still getting the same error.
automountServiceAccountToken: false
UPDATE:
In answer to #p10l I am working in a bare-metal cluster version 1.23.0. No terraform.
I am getting closer, but still not there.
This appears to be another RBAC problem, but the error does not make sense to me.
I have a user "dma." I am running workflows in the "sandbox" namespace using the context dma#kubernetes
The error now is
Create request failed: workflows.argoproj.io is forbidden: User "dma" cannot create resource "workflows" in API group "argoproj.io" in the namespace "sandbox"
but that user indeed appears to have the correct permissions.
This is the output of
kubectl get role dma -n sandbox -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"annotations":{},"name":"dma","namespace":"sandbox"},"rules":[{"apiGroups":["","apps","autoscaling","batch","extensions","policy","rbac.authorization.k8s.io","argoproj.io"],"resources":["pods","configmaps","deployments","events","pods","persistentvolumes","persistentvolumeclaims","services","workflows"],"verbs":["get","list","watch","create","update","patch","delete"]}]}
creationTimestamp: "2021-12-21T19:41:38Z"
name: dma
namespace: sandbox
resourceVersion: "1055045"
uid: 94191881-895d-4457-9764-5db9b54cdb3f
rules:
- apiGroups:
- ""
- apps
- autoscaling
- batch
- extensions
- policy
- rbac.authorization.k8s.io
- argoproj.io
- workflows.argoproj.io
resources:
- pods
- configmaps
- deployments
- events
- pods
- persistentvolumes
- persistentvolumeclaims
- services
- workflows
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
This is the output of kubectl get rolebinding -n sandbox dma-sandbox-rolebinding -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"RoleBinding","metadata":{"annotations":{},"name":"dma-sandbox-rolebinding","namespace":"sandbox"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"dma"},"subjects":[{"kind":"ServiceAccount","name":"dma","namespace":"sandbox"}]}
creationTimestamp: "2021-12-21T19:56:06Z"
name: dma-sandbox-rolebinding
namespace: sandbox
resourceVersion: "1050593"
uid: d4d53855-b5fc-4f29-8dbd-17f682cc91dd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: dma
subjects:
- kind: ServiceAccount
name: dma
namespace: sandbox
The issue you are describing is a reoccuring one, described here and here where your cluster lacks KUBECONFIG environment variable.
First, run echo $KUBECONFIG on all your nodes to see if it's empty.
If it is, look for the config file in your cluster, then copy it to all the nodes, then export this variable by running export KUBECONFIG=/path/to/config. This file can be usually found at ~/.kube/config/ or /etc/kubernetes/admin.conf` on master nodes.
Let me know, if this solution worked in your case.

Openshift - Run pod only for specific time period

I'm new to Openshfit. We are using openshift deployments to deploy our multiple microservice (SpringBoot application). The deployment is done from docker image.
We have a situation that we need to stop one micro service alone from Midnight till morning 5 AM ( due to an external dependency ).
Could someone suggest a way to do this automatically?
I was able to run
oc scale deployment/sampleservice--replicas=0 manually to make number of pods as zero and scale up to 1 manually later.
I'm not sure how to run this command on specific time automatically. The CronJob in Openshift should be able to do this. But not sure how to configure cronjob to execute an OC command.
Any guidance will be of great help
Using a cronjob is a good option.
First, you'll need an image that has the oc command line client available. I'm sure there's a prebuilt one out there somewhere, but since this will be running with privileges in your OpenShift cluster you want something you trust, which probably means building it yourself. I used:
FROM quay.io/centos/centos:8
RUN curl -o /tmp/openshift-client.tar.gz \
https://mirror.openshift.com/pub/openshift-v4/clients/ocp/latest/openshift-client-linux.tar.gz; \
tar -C /bin -xf /tmp/openshift-client.tar.gz oc kubectl; \
rm -f /tmp/openshift-client.tar.gz
ENTRYPOINT ["/bin/oc"]
In order to handle authentication correctly, you'll need to create a ServiceAccount and then assign it appropriate privileges through a Role and a RoleBinding. I created a ServiceAccount named oc-client-sa:
apiVersion: v1
kind: ServiceAccount
metadata:
name: oc-client-sa
namespace: oc-client-example
A Role named oc-client-role that grants privileges to Pod and Deployment objects:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: oc-client-role
namespace: oc-client-example
rules:
- verbs:
- get
- list
- create
- watch
- patch
apiGroups:
- ''
resources:
- pods
- verbs:
- get
- list
- create
- watch
- patch
apiGroups:
- 'apps'
resources:
- deployments
- deployments/scale
And a RoleBinding that connects the oc-client-sa ServiceAccount
to the oc-client-role Role:
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: oc-client-rolebinding
namespace: oc-client-example
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: oc-client-role
subjects:
- kind: ServiceAccount
name: oc-client-sa
With all this in place, we can write a CronJob like this that will
scale down a deployment at a specific time. Note that we're running
the jobs using the oc-client-sa ServiceAccount we created earlier:
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-web-down
namespace: oc-client-example
spec:
schedule: "00 00 * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: oc-client-sa
restartPolicy: Never
containers:
- image: docker.io/larsks/openshift-client
args:
- scale
- deployment/sampleservice
- --replicas=0
name: oc-scale-down
You would write a similar one to scale things back up at 5AM.
The oc client will automatically use the credentials provided to your pod by Kubernetes because of the serviceAccountName setting.
API
You can use the OC rest api client and write the simple python code which will scale down replicas. Pack this python into a docker image and run it as a cronjob inside the OC cluster.
Simple Curl
Run a simple curl inside the cronjob to scale up & down deployment at a certain time.
Here is a simple Curl to scale the deployment: https://docs.openshift.com/container-platform/3.7/rest_api/apis-apps/v1beta1.Deployment.html#Get-apis-apps-v1beta1-namespaces-namespace-deployments-name-scale
API documentation : https://docs.openshift.com/container-platform/3.7/rest_api/apis-apps/v1beta1.Deployment.html
CLI
If you don't want to run code as docker image in cronjob of K8s, you can also run the command, in that case, use the docker image inside cronjob, and fire the command
OC-cli : https://hub.docker.com/r/widerin/openshift-cli
Dont forget authentication is required in both cases either API or running a command inside the cronjob.

Kubernetes: Cannot deploy a simple "Couchbase" service

I am new to Kubernetes I am trying to mimic a behavior a bit like what I do with docker-compose when I serve a Couchbase database in a docker container.
couchbase:
image: couchbase
volumes:
- ./couchbase:/opt/couchbase/var
ports:
- 8091-8096:8091-8096
- 11210-11211:11210-11211
I managed to create a cluster in my localhost using a tool called "kind"
kind create cluster --name my-cluster
kubectl config use-context my-cluster
Then I am trying to use that cluster to deploy a Couchbase service
I created a file named couchbase.yaml with the following content (I am trying to mimic what I do with my docker-compose file).
apiVersion: apps/v1
kind: Deployment
metadata:
name: couchbase
namespace: my-project
labels:
platform: couchbase
spec:
replicas: 1
selector:
matchLabels:
platform: couchbase
template:
metadata:
labels:
platform: couchbase
spec:
volumes:
- name: couchbase-data
hostPath:
# directory location on host
path: /home/me/my-project/couchbase
# this field is optional
type: Directory
containers:
- name: couchbase
image: couchbase
volumeMounts:
- mountPath: /opt/couchbase/var
name: couchbase-data
Then I start the deployment like this:
kubectl create namespace my-project
kubectl apply -f couchbase.yaml
kubectl expose deployment -n my-project couchbase --type=LoadBalancer --port=8091
However my deployment never actually start:
kubectl get deployments -n my-project couchbase
NAME READY UP-TO-DATE AVAILABLE AGE
couchbase 0/1 1 0 6m14s
And when I look for the logs I see this:
kubectl logs -n my-project -lplatform=couchbase --all-containers=true
Error from server (BadRequest): container "couchbase" in pod "couchbase-589f7fc4c7-th2r2" is waiting to start: ContainerCreating
As OP mentioned in a comment, issue was solved using extra mount as explained in documentation: https://kind.sigs.k8s.io/docs/user/configuration/#extra-mounts
Here is OP's comment but formated so it's more readable:
the error shows up when I run this command:
kubectl describe pods -n my-project couchbase
I could fix it by creating a new kind cluster:
kind create cluster --config cluster.yaml
Passing this content in cluster.yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: inf
nodes:
- role: control-plane
extraMounts:
- hostPath: /home/me/my-project/couchbase
containerPath: /couchbase
In couchbase.yaml the path becomes path: /couchbase of course.

kubectl apply -f works on PC but not in Gitlab Runner

I am trying to deploy to kubernetes using Gitlab CICD. No matter what I do, kubectl apply -f helloworld-deployment.yml --record in my .gitlab-ci.yml always returns that the deployment is unchanged:
$ kubectl apply -f helloworld-deployment.yml --record
deployment.apps/helloworld-deployment unchanged
Even if I change the tag on the image, or if the deployment doesn't exist at all. However, if I run kubectl apply -f helloworld-deployment.yml --record from my own computer, it works fine and updates when a tag changes and creates the deployment when no deployment exist. Below is my .gitlab-ci.yml that I'm testing with:
image: docker:dind
services:
- docker:dind
stages:
- deploy
deploy-prod:
stage: deploy
image: google/cloud-sdk
environment: production
script:
- kubectl apply -f helloworld-deployment.yml --record
Below is helloworld-deployment.yml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: helloworld-deployment
spec:
replicas: 2
selector:
matchLabels:
app: helloworld
template:
metadata:
labels:
app: helloworld
spec:
containers:
- name: helloworld
image: registry.gitlab.com/repo/helloworld:test
imagePullPolicy: Always
ports:
- containerPort: 3000
imagePullSecrets:
- name: regcred
Update:
This is what I see if I run kubectl rollout history deployments/helloworld-deployment and there is no existing deployment:
Error from server (NotFound): deployments.apps "helloworld-deployment" not found
If the deployment already exists, I see this:
REVISION CHANGE-CAUSE
1 kubectl apply --filename=helloworld-deployment.yml --record=true
With only one revision.
I did notice this time that when I changed the tag, the output from my Gitlab Runner was:
deployment.apps/helloworld-deployment configured
However, there were no new pods. When I ran it from my PC, then I did see new pods created.
Update:
Running kubectl get pods shows two different pods in Gitlab runner than I see on my PC.
I definitely only have one kubernetes cluster, but kubectl config view shows some differences (the server url is the same). The output for contexts shows different namespaces. Does this mean I need to set a namespace either in my yml file or pass it in the command? Here is the output from the Gitlab runner:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: URL
name: gitlab-deploy
contexts:
- context:
cluster: gitlab-deploy
namespace: helloworld-16393682-production
user: gitlab-deploy
name: gitlab-deploy
current-context: gitlab-deploy
kind: Config
preferences: {}
users:
- name: gitlab-deploy
user:
token: [MASKED]
And here is the output from my PC:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: URL
contexts:
- context:
cluster: do-nyc3-helloworld
user: do-nyc3-helloworld-admin
name: do-nyc3-helloworld
current-context: do-nyc3-helloworld
kind: Config
preferences: {}
users:
- name: do-nyc3-helloworld-admin
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
args:
- kubernetes
- cluster
- kubeconfig
- exec-credential
- --version=v1beta1
- --context=default
- VALUE
command: doctl
env: null
It looks like Gitlab adds their own default for namespace:
<project_name>-<project_id>-<environment>
Because of this, I put this in the metadata section of helloworld-deployment.yml:
namespace: helloworld-16393682-production
And then it worked as expected. It was deploying before, but kubectl get pods didn't show it since that command was using the default namespace.
Since Gitlab use a custom namespace you need to add a namespace flag to you command to display your pods:
kubectl get pods -n helloworld-16393682-production
You can set the default namespace for kubectl commands. See here.
You can permanently save the namespace for all subsequent kubectl commands in that contex
In your case it could be:
kubectl config set-context --current --namespace=helloworld-16393682-production
Or if you are using multiples cluster, you can switch between namespaces using:
kubectl config use-context helloworld-16393682-production
In this link you can see a lot of useful commands and configurations.
I hope it helps! =)

ScheduledJobs on Google Container Engine (kubernetes)

Did someone has an experience running scheduled job? Due to the guide, ScheduledJobs available since 1.4 with enabled runtime batch/v2alpha1
So I was ensured with kubectl api-versions command:
autoscaling/v1
batch/v1
batch/v2alpha1
extensions/v1beta1
storage.k8s.io/v1beta1
v1
But when I tried sample template below with command kubectl apply -f job.yaml
apiVersion: batch/v2alpha1
kind: ScheduledJob
metadata:
name: hello
spec:
schedule: 0/1 * * * ?
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
I got error
error validating "job.yaml": error validating data: couldn't find type: v2alpha1.ScheduledJob; if you choose to ignore these errors, turn validation off with --validate=false
It's possible that feature still not implemented? Or I made some error during template creation?
Thank you in advance.
Okay, I think I resolved this issue. ScheduledJobs is currently in alpha state and Google Container Engine supports this feature only for clusters with additionally enabled APIs. I was able to create such cluster with command:
gcloud alpha container clusters create my-cluster --enable-kubernetes-alpha
As a result now I have limited 30 day cluster with full feature support. I can see scheduled jobs with kubectl get scheduledjobs as well as create new ones with templates.
You can find more info about alpha clusters here.