Apache Airflow run sub DAG - workflow

I'm starting using Apache Airflow in these days because I would like to use it for managing a quite long DAG of Tasks. Given the complexity and the thing that some task could take very long time, sometime would be very useful to run just a task or some subgraph for test purposes or because of some new data only related to these particular tasks, possibly only specifying the starting and ending task to be runnned (eg, given a DAG [a >> b >> c >> d >> e] I would like to run from 'b' to 'd' only). Is it possible with Apache Airflow to accomplish this?
Thanks in advance

Related

How to load balance jobs using spring batch when different nodes has different times?

We have so many batch jobs to handle.
Now problem is we have 7 different nodes which has same application deployed(We use JBoss AS 7.1.1. as a application server) and We use Spring batch using quartz scheduler to schedule jobs.And it works just fine.
But 1 of our nodes is diff time then others (e.g. Suppose we have 3 nodes A,B,C so when there's a 12:00:00 in C there's a 11:58:00 in A and B) and all these nodes are been maintained by client.
So when any trigger fires(we use cron trigger) job run on single node only.
Now specific time(take 12:00) we need to fire more than one job, then all of them runs on a single node as all of them were timed out earlier the other nodes(As 12:00 o'clock happened in C before A and B).
I was wondering do we have any such mechanism where we take reference of any centralized time to time out all batch processes(like do not time out batch process when there's 12 O'clock on C but run batch job when there's a 12 O'clock in DB)..?
Thanks in advance :).
Spring Batch provides facilities to launch jobs via messages in the spring-batch-integration module. I'd recommend managing the scheduling from a central point and having it send messages to the servers to be picked up based on the server's availability to run the job. This would also address the issue of time synchronization as the scheduling piece would be handled in a central point.
Ask your client to synchronize servers using NTP. All of your servers should have same time PERIOD. You will have bunch of other problems if you allow your servers stay out of synch with each other.

Running parallel jobs in Jenkins

I'm using Jenkins for my builds, and I wrote some test scripts that I need to run after the compilation of the build.
I want to save some time, so I have to run the test scripts parallel. How can I do that?
EDIT: ok, I understand know that I need a separate Job for each test (for 4 tests I need 4 jobs, right?)
So, I did that, and via the parent job I ran this jobs. (using "build other projects" plugin).
But I didn't managed to aggregate the results (using aggregate downstream test results). The parent job exits before the downstream jobs were finished.
What shall I do?
Thanks.
You can use multi-job plugin. This would allow you to run multiple jobs in parallel and the parent job would wait for the sub jobs to be completed. The parent jobs status can be determined by the sub jobs status.
https://wiki.jenkins-ci.org/display/JENKINS/Multijob+Plugin
Jenkins doesn't really allow you to run things in parallel. You can however split your build into different jobs to achieve this. It would look like this.
Job to compile the source runs.
Jobs that run the tests are triggered by the completion of the compilation and start running. They copy compilation results from the previous job into their workspaces.
This is a big kludgy though. The better alternative would be to parallelise within the scripts that run the tests, i.e. you run a single script and this then runs the tests in parallel. If this is not possible, you'll have to split into different jobs though.
Have you looked at the Jenkins JOIN Plugin? I have not used it but I believe it is what you are attempting to accomplish.
- Mike
Actually you can but you will need some coding behind.
In my case, I have parallel test execution on jenkins.
1) Create a small job with parameters that is supposed to do a test run with a small suite
2) Edit this job to run on a list of slaves (where you have the proper environment)
3) Edit this build to allow concurrent builds
And now the hard part.
4) Create a small java program for computing the list of parameters for each job to run.
5) Iterate trough the list and launch a new Jenkins job on a new thread.
Put a Thread.sleep(5000) between runs in order to avoid communication errors
6) Join the threads
At the end of each job, I send the results to a shared location in order to perform some reporting at the end of all tests.
For starting a jenkins job with parameters use CLI
I intend to make my code as generic as possible and publish it if anyone else will need it.
You can use https://wiki.jenkins-ci.org/display/JENKINS/Build+Flow+Plugin with code like this
parallel (
// job 1, 2 and 3 will be scheduled in parallel.
{ build("job1") },
{ build("job2") },
{ build("job3") }
)
You can use any one of the followings:
Multijob plugin
Build Flow plugin

How to make sure a task is executed before starting different services

We have the following scenario in upstart:
We have some task called T, and some services, A and B with the following requirements:
T must run completely isolated from the services A and B
Both A and B can only run if task T has completed
A and B can be started independently
In simple words, T is a requirement for both A and B, but running T doesn't necessarily mean that either A or B should be started.
How can we enforce these requirements in upstart? Adding other "helper" jobs is fine, of course.
We tried the following, that doesn't work:
# T.conf
task
start on (starting A or starting B)
The problem is that if T is already running when starting B, e.g. because A is already about to start, then B will just start without waiting for T to finish. This violates the first two requirements above.
Another option is to explicitly start T from the pre-start sections of the services. However, that causes a service to fail to start, instead of waiting, if T is already being executed.
There is a workaround using this extra helper task (better suggestions are still welcome):
start on (starting A or starting B)
task
instance $JOB
script
until start T; do sleep 1; done
end script
This helper job is started just about when either A or B is about to start, blocking those services. There will be one instance of this task for each service. It will block until T is successfully completed.

matlab parallel processing on several nodes

I have studied pages and discussion on matlab processing, but I still don't know how to distribute my program over several nodes(not cores). In the cluster which I am using, there are 10 nodes available, and inside each node there are 8 cores available. When Using "parfor" inside each node (locally between 8 cores), the parallel-ization works fine. But when using several nodes, I think that (not sure how to verify this) it doesn't work well. Here is a piece of program which I run on the cluster:
function testPool2()
disp('This is a comment')
disp(['matlab number of cores : ' num2str(feature('numCores'))])
matlabpool('open',5);
disp('This is another comment!!')
tic;
for i=1:10000
b = rand(1,1000);
end;
toc
tic;
parfor i=1:10000
b = rand(1,1000);
end;
toc
end
And the outputs is :
This is a comment
matlab number of cores : 8
Starting matlabpool using the 'local' profile ... connected to 5 labs.
This is another comment!!
Elapsed time is 0.165569 seconds.
Elapsed time is 0.649951 seconds.
{Warning: Objects of distcomp.abstractstorage class exist - not clearing this
class
or any of its super-classes}
{Warning: Objects of distcomp.filestorage class exist - not clearing this class
or any of its super-classes}
{Warning: Objects of distcomp.serializer class exist - not clearing this class
or any of its super-classes}
{Warning: Objects of distcomp.fileserializer class exist - not clearing this
class
or any of its
super-classes}
The program is first compiled using "mcc -o out testPool2.m" and then transferred to an scratch drive of a server. Then I submit the job using Microsoft HPC pack 2008 R2. Also note that I don't have access to the graphical interface of the MATLAB installed on each of the nodes. I can only submit jobs using MSR HPC Job Manager (see this: http://blogs.technet.com/b/hpc_and_azure_observations_and_hints/archive/2011/12/12/running-matlab-in-parallel-on-a-windows-cluster-using-compiled-matlab-code-and-the-matlab-compiler-runtime-mcr.aspx )
Based on the above output we can see that, the number of the available cores is 8; so I infer that the "matlabpool" only works for local cores in a machine; not between nodes (separate computers connected to each other)
So, any ideas how I can generalize my for loop ("parfor") to nodes ?
PS. I have no idea what are the warnings at the end of the output !
In order to run MATLAB on multiple nodes, the distributed computing server is needed in addition to the parallel computing toolbox. The distributing computing server must be installed and correctly configured on all of the nodes in the cluster. Normally MATLAB distributed server comes with shell scripts for launching parallel MATLAB jobs on, multiple nodes based on scheduler and cluster setup.
Without access to the distributed computing server, MATLAB can only be run on a single node. It would be valuable to verify with the cluster administrator that the distributed computing server is setup and running correctly; in some cases the administrators of these servers even have example scripts for launching and running jobs common to their user base, e.g. MATLAB
Here is a link to documentation on the Distributed Computing Server:
http://www.mathworks.com/help/mdce/index.html?searchHighlight=distributed%20computing%20server

Running certain Hadoop Jobs only on a chosen node and not in the others, managing the process with Oozie

Is that even possible? I've searched quite a bit and I'd say it's not possible, but I think it's so strange a so basilar functionality has not been foreseen.
If i have a cluster of 3 machine and 1 is relative to a part (Let's say an action i Oozie) of the bigger process, can't i say to Oozie to run that job only on node X and not in the other nodes?
I don not think you can enforce Oozie launcher mapper to run on a certain node.