Yesterday I kicked off an oozie workflow. It started two jobs that stalled all day. I killed them this morning, having made a change that I now want to test. After killing the two jobs it's like the workflow became unstuck and is now proceeding. I would like to kill the workflow so it doesn't keep starting new jobs to replace the ones I kill. How can I do that in the oozie command line?
Oozie commands
--------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://localhost:11000/oozie -config oozieProject/workflowHdfsAndEmailActions/job.properties -submit job: 0000001-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://localhost:11000/oozie -start 0000001-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://localhost:11000/oozie -info 0000001-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://localhost:11000/oozie -suspend 0000001-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://localhost:11000/oozie -resume 0000001-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://localhost:11000/oozie -config oozieProject/workflowHdfsAndEmailActions/job.properties -rerun 0000001-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://localhost:11000/oozie -kill 0000001-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://localhost:11000/oozie -logs 0000001-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
You can view your running jobs with:
oozie jobs
or if it's a coordinator, not a workflow:
oozie jobs -jobtype coordinator
And get the Job ID from there, then do:
oozie job -kill [id]
Here's the command line tool reference page: http://incubator.apache.org/oozie/docs/3.1.3/docs/DG_CommandLineTool.html
In addition to the post related to Oozie commands, sometimes we don't have to access to the respective workflow id to suspend/kill etc. and we get below error:
Error: E0508 : E0508: User [?] not authorized for WF job [0001304-190209190348229-oozie-mapr-W]
For this, to perform any operation like kill/suspend etc. we need to generate the authenticating token for our user id. For this, first, we need to clear the existing tokens from the file using below command and then perform suspend/kill etc. action on given workflow id:
rm .oozie-auth-token
From Apache Oozie docs:
Once authentication is performed successfully the received
authentication token is cached in the user home directory in the
.oozie-auth-token file with owner-only permissions. Subsequent
requests reuse the cached token while valid.
For more details, the link of Apache Oozie docs (refer Authentication section):
Official Documentation
I think you will find it helpful how to kill, rerun, etc multiple (example 200) jobs at the same time using bash.
In one single line:
$for jobid in `oozie jobs -filter status=SUSPENDED | cut -d" " -f1`; do echo "Killed job ${jobid}"; job -kill ${jobid}; done
Related
I have a test pipeline on concourse with one job that runs a set of luigi tasks. My problem is: failures in the luigi tasks do not rise up to the concourse job. In other words, if a luigi task fails, concourse will not register that failure and states that the concourse job completed successfully. I will first post the code I am running, then the solutions I have tried.
luigi-tasks.py
class Pipeline1(luigi.WrapperTask):
def requires(self):
yield Task1()
yield Task2()
yield Task3()
tasks.py
class Task1(luigi.Task):
def requires(self):
return None
def output(self):
return luigi.LocalTarget('stuff/task1.csv')
def run(self):
#uncomment line below to generate task failure
#assert(True==False)
print('task 1 complete...')
t = pd.DataFrame()
with self.output().open('w') as outtie:
outtie.write('complete')
# Tasks 2 and 3 are duplicates of this, but with 1s replaced with 2s or 3s.
config file
[retcode]
# codes are in increasing level of severity (for most applications)
already_running=10
missing_data=20
not_run=25
task_failed=30
scheduling_error=35
unhandled_exception=40
begin.sh
#!/bin/sh
set -e
export PYTHONPATH='.'
luigi --module luigi-tasks Pipeline1 --local-scheduler
echo $?
pipeline.yml
# <resources, resource types, and docker image build job defined here>
#job of interest
- name: run-docker-image
plan:
- get: timer
trigger: true
- get: docker-image-ecr
passed: [build-docker-image]
- get: run-git
- task: run-script
image: docker-image-ecr
config:
inputs:
- name: run-git
platform: linux
run:
dir: ./run-git
path: /bin/bash
args: ["begin.sh"]
I've introduced errors in a few ways: assertions/raising an exception (ValueError) within an individual task's run() method and within the wrapper, and sys.exit(luigi.retcodes.retcode().unhandled_exception). I also tried failing all tasks. I did this in case the error needed to be generated in a specific manner/location. Though they all produced a failed task, none of them produced an error in the concourse server.
At first, I thought concourse just gives a success if it can run the file or command tasked to it. I'm not sure it's that simple, though. Interestingly, when I run the pipeline on my local computer (luigi --modules luigi-tasks Pipeline1 --local-scheduler) I get an appropriate return code (e.g. 30), but when I run the pipeline within the concourse server, I get a return code of 0 after the luigi tasks complete (from echo $? in the bash script).
Would appreciate any insight into this problem.
My suspicion is that luigi doesn't see your config file with return codes. Its default behavior is to return 0, whether tasks fail or succeed.
This experiment should help to debug that:
Force a failed job: add an exit 1 at the end of begin.sh
Hijack the job: fly -t <target> i -j <pipeline>/<job> -> select run-script
cd ./run-git; /bin/bash begin.sh
Ensure the luigi config is present and named appropriately, e.g. luigi.cfg
Re-run the command: LUIGI_CONFIG_PATH=luigi.cfg bash ./begin.sh
Check output: echo $?
I am using Dataproc with 1 job on 1 cluster.
I would like to start my job as soon as the cluster is created. I found that the best way to achieve this is to submit a job using an initialization script like below.
function submit_job() {
echo "Submitting job..."
gcloud dataproc jobs submit pyspark ...
}
export -f submit_job
function check_running() {
echo "checking..."
gcloud dataproc clusters list --region='asia-northeast1' --filter='clusterName = {{ cluster_name }}' |
tail -n 1 |
while read name platform worker_count preemptive_worker_count status others
do
if [ "$status" = "RUNNING" ]; then
return 0
fi
done
}
export -f check_running
function after_initialization() {
local role
role=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${role}" == 'Master' ]]; then
echo "monitoring the cluster..."
while true; do
if check_running; then
submit_job
break
fi
sleep 5
done
fi
}
export -f after_initialization
echo "start monitoring..."
bash -c after_initialization & disown -h
is it possible? When I ran this on Dataproc, a job is not submitted...
Thank you!
Consider to use Dataproc Workflow, it is designed for workflows of multi-steps, creating cluster, submitting job, deleting cluster. It is better than init actions, because it is a first class feature of Dataproc, there will be a Dataproc job resource, and you can view the history.
Please consider to use cloud composer - then you can write a single script that creates the cluster, runs the job and terminates the cluster.
I found a way.
Put a shell script named await_cluster_and_run_command.sh on GCS. Then, add the following codes to the initialization script.
gsutil cp gs://...../await_cluster_and_run_command.sh /usr/local/bin/
chmod 750 /usr/local/bin/await_cluster_and_run_command.sh
nohup /usr/local/bin/await_cluster_and_run_command.sh &>>/var/log/master-post-init.log &
reference: https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/post-init/master-post-init.sh
How can a Job with multiple steps run some steps on Node 1 and other on Node 2?
For example:
On Node 1, I have to copy a file to a folder cp file.txt /var/www/htm/
On Node 2, I have to download this file wget https://www.mywebsite.com/file.txt
I have tried creating three jobs,
JOB 1, workflow I have Execute Command on remote cp file.txt /var/www/htm/ and NODES filter to my NODE 1
JOB 2, workflow I have Execute Command on remote wget https://www.mywebsite.com/file.txt and NODES filter to NODE 2
JOB 3, workflow step 1: selected Job Reference, and paste the UUID from JOB 1, step 2 Job reference and paste UUID JOB 2 and node filter I writed .* to get all nodes.
For now I tried to only run a command ls(on JOB 1 and JOB 2), but when I run JOB 3 the output is 3 time the command each job, for example:
// Run Job 3
// Output from Job 1
test-folder
test.text
test-folder
test.text
test-folder
test.text
And same for JOB 2
How can I implement my job?
Using the job reference step is the right way to solve that, but instead of defining .* to get all nodes, you can use the node1 name in the first job reference step call and the node2 name for the second job reference call, on "Override node filters?" section. Alternatively you can define the nodes filter in each job and just call it from the Job 3 using job reference step.
I have a release pipeline with a QA/Smoke Test stage, that generates XML files containing test results.
If I run this manually on my machine, obviously I have access to the XML files and I can see the details but on the agent I cannot since we don't have access to those Microsoft hosted agents to view the files.
Is there a way to pipe the files "out" in the task for viewing? maybe there's a third marketplace task that can achieve that?
Here's the deployment result:
2021-06-06T23:34:19.1260519Z Results File: D:\a\r1\a\qa-automation\TestResults\CurrentReport\Logs\junit.xml
2021-06-06T23:34:19.2448029Z Results File: D:\a\r1\a\qa-automation\TestResults\.\CurrentReport\Logs\detailedLogs.xml
2021-06-06T23:34:19.2533810Z
2021-06-06T23:34:19.2596243Z Failed! - Failed: 22, Passed: 2, Skipped: 0, Total: 24, Duration: 52 m 11 s - EED.dll (netcoreapp3.1)
Here's the stage YAML:
steps:
- script: |
git clone https://.../qa-automation.git -b master
cd qa-automation
testrun.bat --cat "EDSmoke" --env dev
displayName: 'Clone qa-automation repo'
Is there a way to pipe the files "out" in the task for viewing? maybe there's a 3rd marketplace task that can achieve that?
You can try with following task:
Write-host "##vso[task.uploadfile]<PathOfTheFiles>\<filename>"
Like:
Write-host "##vso[task.uploadfile]$(System.DefaultWorkingDirectory)\qa-automation\TestResults\CurrentReport\Logs\junit.xml"
View and download attachments associated with releases
Would you like to upload additional logs or diagnostics or images when
running tasks in a release? This feature enables users to upload
additional files during deployments. To upload a new file, use the
following agent command in your script:
Write-host "##vso[task.uploadfile]"
The file is then available as part of the release logs. When you
download all the logs associated with the release, you will be able to
retrieve this file as well.
You can also add a powershell script task in your release definition to read the smoke test output and output it to the console. Then you will be see the content of the log files from "Logs" tab powershell script step. And you can also click "Download all logs as zip" to download the smoke test result files.
I was new to concourse, and set up the environment in my centos7.6 like below.
$ wget https://concourse-ci.org/docker-compose.yml
$ docker-compose up -d
Then login by `fly --target example login --team-name main --concourse-url http://192.168.77.140:8080/ -u test -p test`
I can see below.
[root#centostest ~]# fly targets
name url team expiry
example http://192.168.77.140:8080 main Sun, 16 Jun 2019 02:23:48 UTC
I used below yaml.xml named with 2.yaml
---
resources:
- name: my-git-repo
type: git
source:
uri: https://github.com/ruanbekker/concourse-test
branch: basic-helloworld
jobs:
- name: hello-world-job
public: true
plan:
- get: my-git-repo
- task: task_print-hello-world
file: my-git-repo/ci/task-hello-world.yml
Then I run below commands step by step.
fly -t example sp -c 2.yaml -p pipeline-01
fly -t example up -p pipeline-01
fly -t example tj -j pipeline-01/hello-world-job --watch
But i just hang on there , no useful response like below.
[root#centostest ~]# fly -t example tj -j pipeline-01/hello-world-job --watch
started pipeline-01/hello-world-job #3
Theoretically, it should print something like below.
Cloning into '/tmp/build/get'...
Fetching HEAD
292c84b change task name
initializing
running echo hello world
hello world
succeeded
Where I did wrong? thanks.
welcome to Concourse!
One thing that can be confusing when starting with Concourse is understanding when Concourse detects that the pipeline has changed and what happens if the pipeline is one file or multiple files.
Your pipeline (as the majority of real-world pipelines) is "nested": main pipeline file 2.yaml refers to a task file named my-git-repo/ci/task-hello-world.yml
What sets Concourse apart from other CI systems is that:
the main pipeline file (2.yaml) can reside everywhere, also in a different repository.
Due to 1, Concourse is unable to detect a change to the main pipeline file, you have to tell Concourse that the file has changed, either with fly set-pipeline or with automatic means such as the concourse-pipeline-resource.
So the following errors happen often:
Changing the main pipeline file, committing and pushing, and expecting Concourse to pick up the change. Missing: you have to do fly set-pipeline
Once doing fly set-pipeline becomes second nature, you can stumble upon the opposite error: Change both the main pipeline file and the nested task file, not pushing, doing set-pipeline. In this case, the only changes picked up by Concourse will be the ones to the main pipeline file, not to the task file. Missing: commit and push.
From the description of your problem, I have the feeling that it is a mixture of the gotchas I mentioned.