Need solution to schedule Spark jobs - scala

I am new to Spark.
In our project,
we have converted seven PLSql scripts into Scala-Spark.
The existing PLSql scripts are scheduled as jobs on Talend. Each
script is a scheduled on a separate job and these seven jobs run on a sequence as only after the first job completes successfully, the second job starts and same continues until the last job(seventh).
My team is exploring the possibilities to schedule the Scala-Spark programs as jobs in other ways. One of the suggestion was to convert/write the same job that is running on Talend into Scala. I have no idea if it is possible.
So, Could anyone let me know whether it is possible to do the same on Scala.

You can submit your spark job in Talend using tSystem or tSSH component. and get the response code (exit code) from the mentioned component. If the exit code=0 (Success) then you can submit next spark job. We did the same in our project.

Related

Talend Automation Job taking too much time

I had developed a Job in Talend and built the job and automated to run the Windows Batch file from the below build
On the Execution of the Job Start Windows Batch file it will invoke the dimtableinsert job and then after it finishes it will invoke fact_dim_combine it is taking just minutes to run in the Talend Open Studio but when I invoke the batch file via the Task Scheduler it is taking hours for the process to finish
Time Taken
Manual -- 5 Minutes
Automation -- 4 hours (on invoking Windows batch file)
Can someone please tell me what is wrong with this Automation Process
The reason of the delay in the execution would be a latency issue. Talend might be installed in the same server where database instance is installed. And so whenever you execute the job in Talend, it will complete as expected. But the scheduler might be installed in the other server, when you call the job through scheduler, it would take some time to insert the data.
Make sure you scheduler and database instance is on the same server
Execute the job directly in the windows terminal and check if you have same issue
The easiest way to know what is taking so much time is to add some logs to your job.
First, add some tWarn at the start and finish of each of the subjobs (dimtableinsert and fact_dim_combine) to know which one is the longest.
Then add more logs before/after the components inside the jobs.
This way you should have a better idea of what is responsible for the slowdown (DB access, writing of some files, etc ...)

Launch spark job on-demand from code

What is the recommended way to launch a Spark job on-demand from within an enterprise application (in Java or Scala)? There is a processing step which currently takes several minutes to complete. I would like to use a Spark cluster to reduce the processing down to, let's say less than 15 seconds:
Rewrite the time consuming process in Spark and Scala.
The parameters would be passed to the JAR as command line arguments. The Spark job then acquires source data from a database. Do the processing and save the output in a location readable by the enterprise application.
Question 1: How to launch the Spark job on-demand from within the enterprise application? The Spark cluster (standalone) is on the same LAN but separate from the servers on which the enterprise app is running.
Question 2: What is the recommended way to transmit the processing results back to the caller code?
Question 3: How to notify the caller code about job completion (or failure such as Spark cluster down, job time out, exception in spark code)
You could try spark-jobserver . Upload your spark.jar to the server. And from your application, you can call the job in your spark.jar using the rest interface . To know whether your job is completed or not , you can keep polling the rest interface. And when your job completes and if the result is very small you could get it from the rest interface itself. But if the result is huge it is better to save to some db.

how to monitor a job from another job in talend open studio 5.3.1 version

Hi i am beginer in Talend Open Studio 5.3.1 version.
currently i am facing issue in project i.e. schedule a job to run every 10 seconds and it monitor the other job and display output as status of another job which means the job is running or idle state.
Currently i am using Talend Open Studio 5.3.1 version by using this version it is possible or not .
explain me how to schelude a job for 10 seconds and display output as status of another job.
can anyone suggest and help me to solve my problem.
We should think a bit out of the box here. I'd solve this by using Project level logging: https://help.talend.com/display/TalendOpenStudioforBigDataUserGuide520EN/2.6+Customizing+project+settings
You'll have the jobs status stored in a database table, you just have to check whether the last execution of the job is still running or not. (Self join the stats table)
Monitoring jobs is not supported in Talend Open Studio, but there is some workaround:
Use a master job that launch the job to be monitored using tRunJob component, and your master job will have an idea whats going on.
Use empty files to synchronize your jobs, an empty file with a tricky name created by monitored jobs and the master job check them to get other jobs states.
Much easier is to use Quartz.

How to trigger a spark job without using "spark-submit"? real-time instead of batch

I have a spark job, which I normally run with spark-submit with the input file name as the argument. Now I want to make the job available for the team, so people can submit an input file (probably through some web-API), then the spark job will be trigger, and it will return user the result file (probably also through web-API). (I am using Java/Scala)
What do I need to build in order to trigger the spark job in such scenario? Is there some tutorial somewhere? Should I use spark-streaming for such case? Thanks!
One way to go is have a web server listening for jobs, and each web request potentially triggering an execution of a spark-submit.
You can execute this using Java's ProcessBuilder.
To the best of my knowledge, there is no good way of invoking spark jobs other than through spark-submit.
You can use Livy.
Livy is an open source REST interface for using Spark from anywhere.
Livy is a new open source Spark REST Server for submitting and interacting with your Spark jobs from anywhere. Livy is conceptually based on the incredibly popular IPython/Jupyter, but implemented to better integrate into the Hadoop ecosystem with multi users. Spark can now be offered as a service to anyone in a simple way: Spark shells in Python or Scala can be ran by Livy in the cluster while the end user is manipulating them at his own convenience through a REST api. Regular non-interactive applications can also be submitted. The output of the jobs can be introspected and returned in a tabular format, which makes it visualizable in charts. Livy can point to a unique Spark cluster and create several contexts by users. With YARN impersonation, jobs will be executed with the actual permissions of the users submitting them.
Please check this url for info.
https://github.com/cloudera/livy
You can use SparkLauncher class to do this. You will need to have a REST API that will take file from the user and after that trigger the spark job using SparkLauncher.
Process spark = new SparkLauncher()
.setAppResource(job.getJarPath())
.setMainClass(job.getMainClass())
.setMaster("master spark://"+this.serverHost + ":" + this.port)
.launch();

Running parallel jobs in Jenkins

I'm using Jenkins for my builds, and I wrote some test scripts that I need to run after the compilation of the build.
I want to save some time, so I have to run the test scripts parallel. How can I do that?
EDIT: ok, I understand know that I need a separate Job for each test (for 4 tests I need 4 jobs, right?)
So, I did that, and via the parent job I ran this jobs. (using "build other projects" plugin).
But I didn't managed to aggregate the results (using aggregate downstream test results). The parent job exits before the downstream jobs were finished.
What shall I do?
Thanks.
You can use multi-job plugin. This would allow you to run multiple jobs in parallel and the parent job would wait for the sub jobs to be completed. The parent jobs status can be determined by the sub jobs status.
https://wiki.jenkins-ci.org/display/JENKINS/Multijob+Plugin
Jenkins doesn't really allow you to run things in parallel. You can however split your build into different jobs to achieve this. It would look like this.
Job to compile the source runs.
Jobs that run the tests are triggered by the completion of the compilation and start running. They copy compilation results from the previous job into their workspaces.
This is a big kludgy though. The better alternative would be to parallelise within the scripts that run the tests, i.e. you run a single script and this then runs the tests in parallel. If this is not possible, you'll have to split into different jobs though.
Have you looked at the Jenkins JOIN Plugin? I have not used it but I believe it is what you are attempting to accomplish.
- Mike
Actually you can but you will need some coding behind.
In my case, I have parallel test execution on jenkins.
1) Create a small job with parameters that is supposed to do a test run with a small suite
2) Edit this job to run on a list of slaves (where you have the proper environment)
3) Edit this build to allow concurrent builds
And now the hard part.
4) Create a small java program for computing the list of parameters for each job to run.
5) Iterate trough the list and launch a new Jenkins job on a new thread.
Put a Thread.sleep(5000) between runs in order to avoid communication errors
6) Join the threads
At the end of each job, I send the results to a shared location in order to perform some reporting at the end of all tests.
For starting a jenkins job with parameters use CLI
I intend to make my code as generic as possible and publish it if anyone else will need it.
You can use https://wiki.jenkins-ci.org/display/JENKINS/Build+Flow+Plugin with code like this
parallel (
// job 1, 2 and 3 will be scheduled in parallel.
{ build("job1") },
{ build("job2") },
{ build("job3") }
)
You can use any one of the followings:
Multijob plugin
Build Flow plugin