I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.
Related
I am currently looking into SAS Viya 3.4 to replace SAS 9.4.
Now I was curious to see the possibilities of the Environment Manager in scheduling Jobs and mantaining and creating Job flows. However, I noticed that I could only Drag and Drop Jobs in a flow and connect them with very few configurable options. Also as a trigger to start a Jobflow I was only able to select a time event. I am wondering if there are other trigger types to choose from. Like a Job will be triggered if a specific table exists or a file exists [or ...]. Neither did I see the possibility to trigger/start a job based on the return code of the previous job.
Also it does not seem to be smart enough to make sure two jobs don't access a library with write access at the same time.
I can't see how SAS Viya could replace a Job Orchestration Tool. However, I feel like the tool was built to replace such an Orchestration Tool. Did I miss something or is it just not possible to do so with the Environment Manager in SAS Viya?
Any help/insights is highly appreciated. I already searched through the documentation but could not find anything.. Maybe I was just looking at the wrong place?
Why 3.4 and not 3.5 (or Viya 4)?
If you want to use Viya with your own Job Orchestration software you can consider this tool (built by my team): https://cli.sasjs.io/job/
We deployed it on Jenkins for this customer: https://www.sas.com/en_us/news/press-releases/2021/july/sas-partnership-with-lloyds-list-intelligence.html
We would like to migrate the scheduling and sequence control of some Kettle import jobs from a proprietary implementation to a Spring Batch flavour, good practice implementation.
I intend to use Spring Cloud Data Flow (SCDF) server to implement and run a configurable sequence of the existing external import jobs.
The SCDF console Task editor UI seems promising to assemble a flow. So one Task wraps one Spring Batch, which in a single step only executes a Tasklet starting and polling the Carte REST API. Does this make sense so far?
Would you suggest a better implementation?
Constraints and Requirements:
The external Kettle jobs are triggered and polled using Carte REST API. Actually, it's one single Kettle job implementation, called with individual parameters for each entity to be imported.
There is a configurable, directed graph of import jobs for several entities, some of them being dependent on a correct import of the previous entity type. (e.g. Department, then Employee, then Role assignments...)
With the upcoming implementation, we would like to get
monitoring and controlling (start, abort, pause, resume)
restartability
easy reconfigurability of the sequence in production (possibly by GUI, or external editor)
possibly some reporting and statistics.
As my current understanding, this could be achieved by using Spring Cloud Data Flow (SCDF) server, and some Task / Batch implementation / combination.
Correct me if I'm wrong, but a single Spring Batch job with its hardwired flow seems not very suitable to me. Or is there an easy way to edit and redeploy a Spring Batch with changed flow in production? I couldn't find anything, not even an easy to use editor for the XML representation of a batch.
Yes, I believe you can achieve your design goals using Spring Cloud Data Flow along with the Spring Cloud Task/Spring Batch.
The flow of multiple Spring Batch Jobs (using the Composed Task) can be managed using Spring Cloud Data Flow as you pointed from the other SO thread.
The external Kettle jobs are triggered and polled using Carte REST API. Actually, it's one single Kettle job implementation, called with individual parameters for each entity to be imported.
There is a configurable, directed graph of import jobs for several entities, some of them being dependent on a correct import of the previous entity type. (e.g. Department, then Employee, then Role assignments...)
Again, both the above can be managed as a Composed Task (with the composed task consisting of a regular task as well as Spring Batch based applications).
You can manage the parameters passed to each task/batch upon invocation via batch job parameters or task/batch application properties or simply command-line arguments.
With the upcoming implementation, we would like to get
monitoring and controlling (start, abort, pause, resume)
restartability
easy reconfigurability of the sequence in production (possibly by GUI, or external editor)
possibly some reporting and statistics.
Spring Cloud Data Flow helps you achieve these goads. You can visit the Task Developer Guide and the Task Monitoring Guide for more info.
You can also check the Batch developer guide from the site as well.
can we generate a Talend Job through its Java code? For the sake of simplicity consider a simple Talend Job without any context variables for different environments and other complications.
Java code for a job is generated by the Studio, but there is no reverse operation consisting in generating a job from its corresponding Java code.
So, the short answer is "No, you can't".
You may consider to generate other files such as job.properties and job.item which contain the configuration for each component used by a job, but I'm afraid it could be very hazardous and at least a very long trip to try to do that.
My team is new to developing these things and I came into a project that is defining an over-arching workflow using separate processes that are all defined under the same project. So it appears that right now the processes defined are all discrete units, and the plan was to connect these units together using inputs and outputs.
Based on the documentation it looks like the best-practicey way of doing this would be to define the entire, over-arching workflow using sub-process tasks.
So I wonder:
Is the implementation we've started workable?
or
Should I only have one process unit per one workflow, which defines sub-processes if the workflow is too complicated and has discrete parts?
It's fine to separate out certain parts of the process into its own process, and then call those from some sort of parent process. The task you should use in the parent process is called reusable sub-process, or call activity. It's absolutely fine to have multiple processes in the same project.
I need to know that how can we run a single job in parallel with different parameters in talend.
The answer is straightforward, but rather depends on what you want, and whether you are using free Talend or commercial.
As far as parameters go, make sure that your jobs are using context variables - this is the preferred way of passing parameters in.
As for running in parallel, there are a few options.
Talend's studio is a java code generator, so you can export your job (it's just java code) and run it wherever you want. How you invoke it is up to you - schedule it, invoke it N times manually, your call. Obviously, if your job touches shared resources then making it safe to run in parallel is up to you - the usual concurrency issues apply.
If you have the commercial product, then you can use the Talend admin centre (TAC). The TAC allows you to schedule a job more than once with different contexts. Or, if you want to keep the parallelization logic inside your job, then consider using the tParallelize component in one job to run another job N times.