Is it possible to submit a job to a cluster using initization script on Google Dataproc? - google-cloud-dataproc

I am using Dataproc with 1 job on 1 cluster.
I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below.
function submit_job() {
echo "Submitting job..."
gcloud dataproc jobs submit pyspark ...
}
export -f submit_job
function check_running() {
echo "checking..."
gcloud dataproc clusters list --region='asia-northeast1' --filter='clusterName = {{ cluster_name }}' |
tail -n 1 |
while read name platform worker_count preemptive_worker_count status others
do
if [ "$status" = "RUNNING" ]; then
return 0
fi
done
}
export -f check_running
function after_initialization() {
local role
role=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${role}" == 'Master' ]]; then
echo "monitoring the cluster..."
while true; do
if check_running; then
submit_job
break
fi
sleep 5
done
fi
}
export -f after_initialization
echo "start monitoring..."
bash -c after_initialization & disown -h
is it possible? When I ran this on Dataproc, a job is not submitted...
Thank you!

Consider to use Dataproc Workflow, it is designed for workflows of multi-steps, creating cluster, submitting job, deleting cluster. It is better than init actions, because it is a first class feature of Dataproc, there will be a Dataproc job resource, and you can view the history.

Please consider to use cloud composer - then you can write a single script that creates the cluster, runs the job and terminates the cluster.

I found a way.
Put a shell script named await_cluster_and_run_command.sh on GCS. Then, add the following codes to the initialization script.
gsutil cp gs://...../await_cluster_and_run_command.sh /usr/local/bin/
chmod 750 /usr/local/bin/await_cluster_and_run_command.sh
nohup /usr/local/bin/await_cluster_and_run_command.sh &>>/var/log/master-post-init.log &
reference: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/post-init/master-post-init.sh

Related

Buildah vs Kaniko

I'm using ArgoWorkflow to automate our CI/CD chains.
In order to build images, and push them to our private registry we are faced between the choice of either buildah or kaniko. But I can't put my finger on the main difference between the two. Pros and cons wise, and also on how do these tools handle parallel builds and cache management. Can anyone clarify these points ? Or even suggest another tool that can maybe do the job in a more simple way.
Some clarifications on the subject would be really helpful.
Thanks in advance.
buildah will require either a privileged container with more then one UID or a container running with CAP_SETUID, CAP_SETGID to build container images.
It is not hacking on the file system like kanicko does to get around these requirements. It runs full contianers when building.
--isolation chroot, will make it a little easier to get buildah to work within kubernetes.
kaniko is very simple to setup and has some magic that let it work with no requirements in kubernetes :)
I also tried buildah but was unable to configure it and found it too complex to setup in a kubernetes environment.
You can use an internal Docker registry as cache management for kaniko, but a local storage can be configured instead (not tried yet). Just use the latest version of kaniko (v1.7.0), that fixes an important bug in the cached layers management.
These are some functions (declared in the file ci/libkaniko.sh) that I use in my GitLab CI pipelines, executed by a GitLab kubernetes runner. They should hopefully clarify setup and usage of kaniko.
function kaniko_config
{
local docker_auth="$(echo -n "$CI_REGISTRY_USER:$CI_REGISTRY_PASSWORD" | base64)"
mkdir -p $DOCKER_CONFIG
[ -e $DOCKER_CONFIG/config.json ] || \
cat <<JSON > $DOCKER_CONFIG/config.json
{
"auths": {
"$CI_REGISTRY": {
"auth": "$docker_auth"
}
}
}
JSON
}
# Usage example (.gitlab-ci.yml)
#
# build php:
# extends: .build
# variables:
# DOCKER_CONFIG: "$CI_PROJECT_DIR/php/.docker"
# DOCKER_IMAGE_PHP_DEVEL_BRANCH: &php-devel-image "${CI_REGISTRY_IMAGE}/php:${CI_COMMIT_REF_SLUG}-build"
# script:
# - kaniko_build
# --destination $DOCKER_IMAGE_PHP_DEVEL_BRANCH
# --dockerfile $CI_PROJECT_DIR/docker/images/php/Dockerfile
# --target devel
function kaniko_build
{
kaniko_config
echo "Kaniko cache enabled ($CI_REGISTRY_IMAGE/cache)"
/kaniko/executor \
--build-arg http_proxy="${HTTP_PROXY}" \
--build-arg https_proxy="${HTTPS_PROXY}" \
--build-arg no_proxy="${NO_PROXY}" \
--cache --cache-repo $CI_REGISTRY_IMAGE/cache \
--context "$CI_PROJECT_DIR" \
--digest-file=/dev/termination-log \
--label "ci.job.id=${CI_JOB_ID}" \
--label "ci.pipeline.id=${CI_PIPELINE_ID}" \
--verbosity info \
$#
[ -r /dev/termination-log ] && \
echo "Manifest digest: $(cat /dev/termination-log)"
}
With these functions a new image can be built with:
stages:
- build
build app:
stage: build
image:
name: gcr.io/kaniko-project/executor:v1.7.0-debug
entrypoint: [""]
variables:
DOCKER_CONFIG: "$CI_PROJECT_DIR/app/.docker"
DOCKER_IMAGE_APP_RELEASE_BRANCH: &app-devel-image "${CI_REGISTRY_IMAGE}/phelps:${CI_COMMIT_REF_SLUG}"
GIT_SUBMODULE_STRATEGY: recursive
before_script:
- source ci/libkaniko.sh
script:
- kaniko_build
--destination $DOCKER_IMAGE_APP_RELEASE_BRANCH
--digest-file $CI_PROJECT_DIR/docker-content-digest-app
--dockerfile $CI_PROJECT_DIR/docker/Dockerfile
artifacts:
paths:
- docker-content-digest-app
tags:
- k8s-runner
Note that you have to use the debug version of kaniko executor because this image tag provides a shell (and other busybox based binaries).

Using kfp.dls.containerOp() to run multiple scripts on Kubeflow Pipelines

I have been using the Kubeflow dsl container op command to run a python script on a custom for my Kubeflow pipeline. My configuration looks something like this :
def test_container_op():
input_path = '/home/jovyan/'
return dsl.ContainerOp(
name='test container',
image="<image name>",
command=[
'python', '/home/jovyan/test.py'
],
file_outputs={
'modeule-logs' : input_path + 'output.log'
}
)
Now, I also want to run a bash script called deploy.sh within the same container. I haven't seen examples of that. Is there something like
command = [
'/bin/bash', '/home/jovyan/deploy.sh',
'python', '/home/jovyan/test.py'
]
Not sure if it's possible. Would appreciate the help.
Kubeflow job is just a Kubernetes job, thus you are limited with Kubernetes job entrypoint being a single command.
However you can still chain multiple commands into a single sh command:
sh -c "echo 'my first job' && echo 'my second job'"
So that you kubeflow command can be:
command = [
'/bin/sh', '-c', '/home/jovyan/deploy.sh && python /home/jovyan/test.py'
]

Ignore pubsub topic if it already created

I have a simple script to deploy a pubsub application.
This script will run on every deploy of my Cloud Run service and I have a line with:
gcloud pubsub topics create some-topic
I want to improve my script if the topic already exist, currently if I run my script, the output will be:
ERROR: Failed to create topic [projects/project-id/topics/some-topic]: Resource already exists in the project (resource=some-topic).
ERROR: (gcloud.pubsub.topics.create) Failed to create the following: [some-topic].
I tried the flag --no-user-output-enabled but no success.
Is there a way to ignore if the resource already exists, or a way to check before create?
Yes.
You can repeat the operation knowing that, if the topic didn't exist beforehand, it will if the command succeeds.
You can swallow stderr (with 2>/dev/null) and then check whether the previous command ($?) succeeded (0):
gcloud pubsub topic create do-something 2>/dev/null
if [ $? -eq 0 ]
then
# Command succeeded, topic did not exist
echo "Topic ${TOPIC} did not exist, created."
else
# Command did not succeed, topic may (!) not have existed
echo "Failure"
fi
NOTE This approach misses the fact that, the command may fail and the topic didn't exist (i.e. some other issue).
Alternatively (more accurately and more expensively!) you can enumerate the topics first and then try (!) to create it if it doesn't exist:
TOPIC="some-topic"
RESULT=$(\
gcloud pubsub topics list \
--filter="name.scope(topics)=${TOPIC}" \
--format="value(name)" 2>/dev/null)
if [ "${RESULT}" == "" ]
then
echo "Topic ${TOPIC} does not exist, creating..."
gcloud pubsub topics create ${TOPIC}
if [ $? -eq 0 ]
then
# Command succeeded, topic created
else
# Command did not succeed, topic was not created
fi
fi
Depending on the complexity of your needs, you can automate using:
any of Google's (Pub/Sub) libraries which provide better error-handling and retry capabilities.
Terraform e.g. google_pubsub_topic
I had this same issue so I thought I'd try to give a full-fledged function to address this. Building on what #DazWilkin posted, below is a bash script that takes 2 inputs
Project you want to point to
Topic/Subscription Name (In this example the topic and subscription names are the same, however, it's quite straightforward to have an additional input be assigned to the subscription name)
The function will:
Check if the current working project is the same. If not it will set it
Check if the topic exists in the project. If not it will attempt to create it and wait for the response
Check if the subscription exists in the project. If not it will also attempt to create it and wait for a response
function create_pubsub() {
# Get Current Project
current_project=$(gcloud config get-value project)
echo "Current Project is: ${current_project}"
# Check if Current project matches the specified project
if [[ "$current_project" != "$1" ]]; then
gcloud config set project $1
else
echo "The project provided matches the current working project"
fi
# Check if topic exists in project
_topic=$(gcloud pubsub topics list \
--filter="name.scope(topics)=$2" \
--format="value(name)" 2>/dev/null)
# React accordingly
if [[ "${_topic}" != "" ]]; then
echo "Topic $2 already exists in project ${current_project}"
else
echo "The topic '$2' does not exist in project ${current_project}. Creating it now..."
gcloud pubsub topics create $2
# Check if command executed successfully
if [ $? -eq 0 ]; then
echo "Topic $2 was created successfully"
else
echo "An error occured. Topic was NOT created"
fi
fi
# Check if subscription exists in project
_subscription=$(gcloud pubsub subscriptions list \
--filter="name=projects/$1/subscriptions/$2" \
--format="value(name)" 2>/dev/null)
# React Accordingly
if [[ "${_subscription}" != "" ]]; then
echo "Subscription $2 already exists in project ${current_project}"
else
echo "The subscription '$2' does not exist in project ${current_project}. Creating it now..."
gcloud pubsub subscriptions create $2 --topic=$2
# Check if command executed successfully
if [ $? -eq 0 ]; then
echo "Subscription $2 was created successfully"
else
echo "An error occured. Subscription was NOT created"
fi
fi
}
After adding this function to your bashrc or zshrc file, the way you would call this function in the terminal would be create_pubsub <PROJECT_ID> <TOPIC_ID>
Hope this is helpful.

Node - Run last bash command

What is happening:
user starts local react server via any variation of npm [run] start[:mod]
my prestart script runs and kills the local webserver if found
Once pkill node is run, that kills the npm start script as well so I want to run the starting command again.
My current solution is to do
history 1 | awk '/some-regex/
to get the name of the last command which I can run with
exec('bash -c 'sleep 1 ; pkill node && ${previousCommand}' &')
This is starting to get pretty hacky so I'm thinking there has to be a better way to do this.
My node script so far:
const execSync = require("child_process").execSync;
const exec = require("child_process").exec;
const netcat = execSync('netcat -z 127.0.0.1 3000; echo $?') == 1 ? true : false; // true when :3000 is available #jkr
if(netcat == false) {
exec(`bash -c 'sleep 1 ; pkill node' &`);
console.warn('\x1b[32m%s\x1b[0m', `\nKilling all local webservers, please run 'npm start' again.\n`);
}
There seems to be an npm package which does this: kill-port
const kill = require('kill-port')
kill(port, 'tcp').then(console.log).catch(console.log)
Source: https://www.npmjs.com/package/kill-port
I understand this might not answer the question of running last command but should solve OP's problem.

Stop Oozie workflow execution

Yesterday I kicked off an oozie workflow. It started two jobs that stalled all day. I killed them this morning, having made a change that I now want to test. After killing the two jobs it's like the workflow became unstuck and is now proceeding. I would like to kill the workflow so it doesn't keep starting new jobs to replace the ones I kill. How can I do that in the oozie command line?
Oozie commands
--------------
Note: Replace oozie server and port, with your cluster-specific.
 
1) Submit job:
$ oozie job -oozie http://localhost:11000/oozie -config oozieProject/workflowHdfsAndEmailActions/job.properties -submit job: 0000001-130712212133144-oozie-oozi-W
 
2) Run job:
$ oozie job -oozie http://localhost:11000/oozie -start 0000001-130712212133144-oozie-oozi-W
 
3) Check the status:
$ oozie job -oozie http://localhost:11000/oozie -info 0000001-130712212133144-oozie-oozi-W
 
4) Suspend workflow:
$ oozie job -oozie http://localhost:11000/oozie -suspend 0000001-130712212133144-oozie-oozi-W
 
5) Resume workflow:
$ oozie job -oozie http://localhost:11000/oozie -resume 0000001-130712212133144-oozie-oozi-W
 
6) Re-run workflow:
$ oozie job -oozie http://localhost:11000/oozie -config oozieProject/workflowHdfsAndEmailActions/job.properties -rerun 0000001-130712212133144-oozie-oozi-W
 
7) Should you need to kill the job:
$ oozie job -oozie http://localhost:11000/oozie -kill 0000001-130712212133144-oozie-oozi-W
 
8) View server logs:
$ oozie job -oozie http://localhost:11000/oozie -logs 0000001-130712212133144-oozie-oozi-W
 
Logs are available at:
/var/log/oozie on the Oozie server.
You can view your running jobs with:
oozie jobs
or if it's a coordinator, not a workflow:
oozie jobs -jobtype coordinator
And get the Job ID from there, then do:
oozie job -kill [id]
Here's the command line tool reference page: http://incubator.apache.org/oozie/docs/3.1.3/docs/DG_CommandLineTool.html
In addition to the post related to Oozie commands, sometimes we don't have to access to the respective workflow id to suspend/kill etc. and we get below error:
Error: E0508 : E0508: User [?] not authorized for WF job [0001304-190209190348229-oozie-mapr-W]
For this, to perform any operation like kill/suspend etc. we need to generate the authenticating token for our user id. For this, first, we need to clear the existing tokens from the file using below command and then perform suspend/kill etc. action on given workflow id:
rm .oozie-auth-token
From Apache Oozie docs:
Once authentication is performed successfully the received
authentication token is cached in the user home directory in the
.oozie-auth-token file with owner-only permissions. Subsequent
requests reuse the cached token while valid.
For more details, the link of Apache Oozie docs (refer Authentication section):
Official Documentation
I think you will find it helpful how to kill, rerun, etc multiple (example 200) jobs at the same time using bash.
In one single line:
$for jobid in `oozie jobs -filter status=SUSPENDED | cut -d" " -f1`; do echo "Killed job ${jobid}"; job -kill ${jobid}; done