Go through all kubernetes Jobs using google cloud functions - kubernetes

I'm using kubernetes==10.1.0
My code is in python 3.7 and something like that:
from kubernetes import client as kubernetes_client
BatchV1_api = kubernetes_client.BatchV1Api()
api_response = BatchV1_api.list_namespaced_job(namespace="default", watch = False, pretty='true', async_req=False )
The problem is I got about 1500 jobs in kubernetes and api_response returns only 20
The goal is to implement program which go throw all jobs and delete old ones by job name as parameter
Any idea why I'm getting only partial data from BatchV1_api.list_namespaced_job function?

Related

Run cronjobs in a sequential manner using kubernetes scheduler

I am new to kubernetes. I want to run a cronjob using kubernetes scheduler that does the following tasks in a sequence:
Ingest data from datawarehouse into a mysql RDMS
Process the data
Indexing of the data using elastic search
Step 2 should occur after the completion of step 1 and step 3 should occur after the completion of step 2. May I know how can I achieve this without using argo or Brigade and simply with kubernetes cronjobs (The idea is to make the ingestion-processing-indexing workflow simple)? And if there's a sample code available for it?

How to Properly Update the Status of a Job

As far as I know, when most people want to know if a Kubernetes (or Spark even) Job is done, they initiate some sort of loop somewhere to periodically check if the Job is finished with the respective API.
Right now, I'm doing that with Kubernetes in the background with the disown (&) operator (bash inside Python below):
import subprocess
cmd = f'''
kubectl wait \\
--for=condition=complete \\
--timeout=-1s \\
job/job_name \\
> logs/kube_wait_log.txt \\
&
'''
kube_listen = subprocess.run(
cmd,
shell = True,
stdout = subprocess.PIPE
)
So... I actually have two (correlated) questions:
Is there a better way of doing this in the background with shell other than with the & operator?
The option that I think would be best is actually to use cURL from inside the Job to update my Local Server API that interacts with Kubernetes.
However, I don't know how I can perform a cURL from a Job. Is it possible?
I imagine you would have to expose ports somewhere but where? And is it really supported? Could you create a Kubernetes Service to manage the ports and connections?
If you don't want to block on a process running to completion, you can create a subprocess.Popen instance instead. Once you have this, you can poll() it to see if it's completed. (You should try really really really hard to avoid using shell=True if at all possible.) So one variation of this could look like (untested):
with open('logs/kube_wait_log.txt', 'w') as f:
with subprocess.Popen(['kubectl', 'wait',
'--for=condition=complete',
'--timeout=-1s',
'job/job_name'],
stdin=subprocess.DEVNULL,
stdout=f,
stderr=subprocess.STDOUT) as p:
while True:
if p.poll():
job_is_complete()
break
time.sleep(1)
Better than shelling out to kubectl, though, is using the official Kubernetes Python client library. Rather than using this "wait" operation, you would watch the job object in question and see if its status changes to "completed". This could look roughly like (untested):
from kubernetes import client, watch
jobsv1 = client.BatchV1Api()
w = watch.watch()
for event in w.stream(jobsv1.read_namespaced_job, 'job_name', 'default'):
job = event['object']
if job.status.completion_time is not None:
job_is_complete()
break
The Job's Pod doesn't need to update its own status with the Kubernetes server. It just needs to exit with a successful status code (0) when it's done, and that will get reflected in the Job's status field.

Is it possible to wait until an EMR cluster is terminated?

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.
I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:
val runSparkJob = new StepConfig()
.withName("Run Pipeline")
.withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
.withHadoopJarStep(
new HadoopJarStepConfig()
.withJar("/path/to/jar")
.withArgs(
"spark-submit",
"etc..."
)
)
// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
new RunJobFlowRequest()
.withName(clusterName)
.withReleaseLabel(Configs.EMR_RELEASE_LABEL)
.withSteps(enableDebugging, runSparkJob)
.withApplications(new Application().withName("Spark"))
.withLogUri(Configs.LOG_URI_PREFIX)
.withServiceRole(Configs.SERVICE_ROLE)
.withJobFlowRole(Configs.JOB_FLOW_ROLE)
.withInstances(
new JobFlowInstancesConfig()
.withEc2SubnetId(Configs.SUBNET)
.withInstanceCount(Configs.INSTANCE_COUNT)
.withKeepJobFlowAliveWhenNoSteps(false)
.withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
.withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
)
val newCluster = emr.runJobFlow(createClusterRequest)
I have two concrete questions:
The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?
My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:
Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.
Is there any way I can get my hands on this error programmatically in my Java/Scala application?
Yes, it is very possible to wait until an EMR cluster is terminated.
There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.
val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
.withClusterId(newCluster.getClusterId())
// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))
Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.
val result = emr.describeCluster(describeRequest)
Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

Launch function on cluster with DASK

I am new to DASK and would like to make a test of running DASK on a cluster. The cluster has a head server and several other nodes. I can enter into other nodes by a simple ssh without password, once I log in the head server.
I would like to run a simple function to iterate over a large array. The function is defined below. It is to convert dt64 to numpy datetime object.
import xarray as xr
import numpy as np
from dask import compute, delayed
import dask.multiprocessing
from datetime import datetime, timedelta
def converdt64(dt64):
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
return datetime.utcfromtimestamp(ts)
Then on the terminal, I iterate over an array 1D with size of N by applying this function.
values = [delayed(convertdt64)(x) for x in arraydata]
results1 = compute(*values,scheduler='processes’)
This uses some cores on the head server and it works, though slowly. Then I tried to launch the function on several nodes of the cluster by using the Client as below:
from dask.distributed import Client
client = Client("10.140.251.254:8786 »)
results = compute(*values, scheduler='distributed’)
It does not work at all. There are some warnings and one error message as in the following.
distributed.comm.tcp - WARNING - Could not set timeout on TCP stream: [Errno 92] Protocol not available
distributed.comm.tcp - WARNING - Closing dangling stream in <TCP local=tcp://10.140.251.254:57257 remote=tcp://10.140.251.254:8786>
CancelledError: convertdt64-0205ad5e-214b-4683-b5c4-b6a2a6d8e52f
I also tried dask.bag and I got the same error message. What may be the reasons that the parallel computation on the cluster does not work ? Is it due to some server/network configuration, or my incorrect use of DASK client ? Thanks in advance for your help !
Best wishes
Shannon X
...then I tried to launch the function on several nodes of the cluster by using the Client as below:
I had similar issues trying to run tasks on the scheduler. The nodes connect just fine. Attempting to submit tasks, however, results in cancellation.
The documented examples were either local or from the same node as the scheduler. When I moved my client to the scheduler node the problem went away.

Get Redshift cluster status in outputs of cloudformation

I am creating a redshift cluster using CF and then I need to output the cluster status (basically if its available or not). There are ways to output the endpoints and port but I could not find any possible way of outputting the status.
How can I get that, or it is not possible ?
You are correct. According to AWS::Redshift::Cluster - AWS CloudFormation, the only available outputs are Endpoint.Address and Endpoint.Port.
Status is not something that you'd normally want to output from CloudFormation because the value changes.
If you really want to wait until the cluster is available, you could create a WaitCondition and then have something monitor the status and the signal for the Wait Condition to continue. This would probably need to be an Amazon EC2 instance with some User Data. Linux instances are charged per-second, so this would be quite feasible.