Run a single job in parallel - talend

I need to know that how can we run a single job in parallel with different parameters in talend.

The answer is straightforward, but rather depends on what you want, and whether you are using free Talend or commercial.
As far as parameters go, make sure that your jobs are using context variables - this is the preferred way of passing parameters in.
As for running in parallel, there are a few options.
Talend's studio is a java code generator, so you can export your job (it's just java code) and run it wherever you want. How you invoke it is up to you - schedule it, invoke it N times manually, your call. Obviously, if your job touches shared resources then making it safe to run in parallel is up to you - the usual concurrency issues apply.
If you have the commercial product, then you can use the Talend admin centre (TAC). The TAC allows you to schedule a job more than once with different contexts. Or, if you want to keep the parallelization logic inside your job, then consider using the tParallelize component in one job to run another job N times.

Related

Talend job batch processing

I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.

Running Parallel Jobs and getting the aggregated results

I had a quick question about the workflow plugin. I'm trying to see if the plugin will be able to satisfy my use case:
We have a jenkins job that will build our app
We want to spin off a suite of test jobs that will perform various tests on the newly built app (unit, integration, etc). These will need to be run in parallel and we want to run them on more than one jenkins node for performance reasons
We'll take the aggregated output from all our test processes from step 2 and be able to decide whether or not we should deploy (everything's passed) or not
I was curious as to whether or not I'd be able to accomplish this within the plugin and if so if you had any tips/pointers to a start.
Thanks!
You can certainly run nodes inside parallel branches. If one branch fails, the parallel step as a whole fails. If you want the build to succeed, but behave differently depending on test results, you can capture them directly as Groovy variables in various ways.
If you are using JUnitArchiver, currently it does not provide a simple means of exposing the test results directly to the Pipeline script (JENKINS-26276), though if you just want to tell if there are some failures or none, you can inspect currentBuild.status.
If you have JUnit-format test results and wish to automatically split them amongst various nodes (especially helpful in case you have a large pool of machines and it would be unmaintainable to manually divide your tests), see this demo of the Parallel Test Executor plugin’s splitTests step.

How to connect separate processes under the same project (jBPM)

My team is new to developing these things and I came into a project that is defining an over-arching workflow using separate processes that are all defined under the same project. So it appears that right now the processes defined are all discrete units, and the plan was to connect these units together using inputs and outputs.
Based on the documentation it looks like the best-practicey way of doing this would be to define the entire, over-arching workflow using sub-process tasks.
So I wonder:
Is the implementation we've started workable?
or
Should I only have one process unit per one workflow, which defines sub-processes if the workflow is too complicated and has discrete parts?
It's fine to separate out certain parts of the process into its own process, and then call those from some sort of parent process. The task you should use in the parent process is called reusable sub-process, or call activity. It's absolutely fine to have multiple processes in the same project.

quartz-scheduler depend jobs

I'm working on a project with Quartz and has been a problem with the dependencies with jobs.
we have a setup where A and B aren't dependent on eachother, though C is:
A and B can run at the same time, but C can only run when both A and B are complete.
Is there a way to set this kind of scenario up in Quartz, so that C will only trigger when A and B finish?
Not directly AFAIK, but it should be not too hard to use a TriggerListener to implement such a functionality (a TriggerListener is run both a start and end of jobs, and you can set them up for individual triggers or trigger groups).
EDIT: there is even a specific FAQ Topic about this problem:
There currently is no "direct" or "free" way to chain triggers with
Quartz. However there are several ways you can accomplish it without
much effort. Below is an outline of a couple approaches:
One way is to use a listener (i.e. a TriggerListener, JobListener or
SchedulerListener) that can notice the completion of a job/trigger and
then immediately schedule a new trigger to fire. This approach can get
a bit involved, since you'll have to inform the listener which job
follows which - and you may need to worry about persistence of this
information. See the listener
org.quartz.listeners.JobChainingJobListener which ships with Quartz -
as it already has some of this functionality.
Another way is to build a Job that contains within its JobDataMap the
name of the next job to fire, and as the job completes (the last step
in its execute() method) have the job schedule the next job. Several
people are doing this and have had good luck. Most have made a base
(abstract) class that is a Job that knows how to get the job name and
group out of the JobDataMap using pre-defined keys (constants) and
contains code to schedule the identified job. This abstract Job's
implementation of execute() delegates to an abstract template method
such as "doWork()" (where the extending Job class's real work goes)
and then it contains the code for scheduling the follow-up job. Then
they simply make extensions of this class that included the work the
job should do. The usage of 'durable' jobs, or the overloaded
addJob(JobDetail, boolean, boolean) method (added in Quartz 2.2) helps
the application define all the jobs at once with their proper data,
without yet creating triggers to fire them (other than one trigger to
fire the first job in the chain).
In the future, Quartz will provide a much cleaner way to do this, but
until then, you'll have to use one of the above approaches, or think
of yet another that works better for you.

matlab distributed computing with sge(qsub)

Recently I got access to run my codes on a cluster. My code is totally paralleizable but I don't know how to best use its parallel nature. I've to compute elements of a big matrix and each of them are independent of the others. I want to submit the job to run on several machine (like 100) to speed up the computation of the matrix.
Right now, I wrote a script to submit multiple jobs each responsible to compute a part of the matrix and save it in a .mat file. At the end I'm merging them to get the whole matrix. For submitting each individual job, I've created a new .m file (run1.m, run.2, ...) to set a variable and then run the function to compute the associated part in the matrix. So basically run1.m is
id=1;compute_dists_matrix
and then compute_dists_matrix uses id to find the part it is going to compute. Then I wrote a script to create run1.m through run60.m and the qsub them to the cluster.
I wonder if there is a better way to do this using some MATLAB features for example. Because this seems to be a very typical task.
Yes, it works, but is not ideal, and as you say is a common problem. Matlab has a parallel programming toolkit.
Does your cluster have this? If so, the distributed arrays is worth having a look at. If they don't have access to it, then what you are doing is the only other way. You can wrap your run1.m,run2.m in a controlling script to automate it for you...
I believe you could use command line arguments for the id and submit jobs with a range of values for this id. Command line arguments can be processed by launching MATLAB from the command line without the IDE and providing the name of the script to be executed and the list of arguments. I would think you can set up dependencies in your job manager and create a "reduce" script to merge the partial results (from files). The whole process could be managed from a single script that would generate the id & other necessary arguments and submit the processing & postprocessing jobs with dependencies.