Is there any way to schedule non hadoop jobs in Talend - scheduler

Can anyone suggest some method to schedule non hadoop jobs in Talend OpenStudio for Big Data.
I have seen a scheduler using oozie, but it will work for hadoop related jobs only.

You should be able to export your job as a JAR and then use your system's scheduler to run the JAR as normal.
To do this on Windows you should schedule the java.exe with an argument of your exported job JAR location.

Related

Application kill of spark on yarn via Zeppelin

Is there a recommended way to application kill spark on yarn from inside Zeppelin (using scala)? In the spark shell I use
:q
and it cleanly exits the shell, kills the application on yarn, and unreserves the cores I was using.
I've found that I can use
sys.exit
which does kill the application on yarn successfully, but it also throws an error and requires that I restart the interpreter if I want to start a new session. If I'm actively running another notebook with a separate instance of the same interpreter then sys.exit isn't ideal because I can't restart the interpreter until I've finished the work in the second notebook.
You probably want to go to the YARN UI and kill the application there. It should be running on port 8088 of your primary name node. However, this will require a restart of the service, as well.
Ideally you let YARN deal with this, though. Just because Zeppelin will start Spark with a specified number of executors and cores doesn't mean these are "reserved" in the way you think. These cores are still available for other containers. YARN manages these resources very well. Unless you have a limited cluster and/or are doing something that requires every last drop of resource management from YARN then you should be fine to leave the Spark application that Zeppelin is using alone.
You could try restarting the Zeppelin Spark interpreter (which can be done from within the interpreter settings page). This should kill the Zeppelin app, but will only restart the interpreter (and hence the Zeppelin app), when you try executing a paragraph again.

Installing dependencies/libraries for EMR for spark-shell

I am trying add extra libraries to scala used through spark-shell through the Elsatic MapReduce inatance. But I am unsure how to go by this, is there a build tool that is used when spark-shell runs?
All i need to do is install a scala library and have it run through the spark-shell version of scala, Im not sure how to go about this since Im not sure how the EMR instance installs scala and spark.
I think that this answer will evolve with the information you give. As for now, considering that you have AWS EMR cluster deployed on which you wish to use the spark-shell. There is many options :
Option 1 : You can copy your libraries to the cluster with the scp command and add them into your spark-shell with the --jars options. e.g :
from your local machine :
scp -i awskey.pem /path/to/jar/lib.jar hadoop#emr-cluster-address:/path/to/destination
on your EMR cluster :
spark-shell --master yarn --jars lib.jar
Spark uses the following URL scheme to allow different strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Option 2 : You can have a copy of your libraries from S3 and add them with --jars option.
Option 3 : You can use the --packages options to load it from remote repository. You can include any other dependencies by supplying a comma-delimited list of maven coordinates. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages.
For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

Talend jobs deployment

I am new to Talend Open Studio and I'd like to develop a job on a Macbook or a Windows PC and then export the job and run it on a Linux server as a scheduled job (i.e. cron).
The job will involve extracting data from 2 Oracle databases on different servers, getting data from a CSV file from another server and then insert the extracted data into another Oracle database server.
Can this be achieved?
Do I need to install the same Talend release on the Linux server?
Please advise what all the software I need to install on the Linux server for it to work.
Thanks in advance
- R
All you need on the linux box is JRE, preferably 1.7, but 1.6 could work if you build your jobs like that.
Then you build your job in talend, this will create a zip file including all the dependencies, you extract these zips on linux. they can be on the same folder so the dependencies are not stored twice.
Then schedule the .sh script in CRON.
I'd also suggest to use the built-in project level logging so you would know:
When did the job start.
What were the error messages.
If you use tFlowMeter then the number of records loaded.

Running SBT (Scala) on several (cluster) machines at the same time

So I've been playing with Akka Actors for a while now, and have written some code that can distribute computation across several machines in a cluster. Before I run the "main" code, I need to have an ActorSystem waiting on each machine I will be deploying over, and I usually do this via a Python script that SSH's into all the machines and starts the process by doing something like cd /into/the/proper/folder/ and then sbt 'run-main ActorSystemCode'.
I run this Python script on one of the machines (call it "Machine X"), so I will see the output of SSH'ing into all the other machines in my Machine X SSH session. Whenever I do run the script, it seems all the machines are re-compiling the entire code before actually running it, making me sit there for a few minutes before anything useful is done.
My question is this:
Why do they need to re-compile at all? The same JVM is available on all machines, so shouldn't it just run immediately?
How do I get around this problem of making each machine compile "it's own copy"?
sbt is a build tool and not an application runner. Use sbt-assembly to build an all in one jar and put the jar on each machine and run it with scala or java command.
It's usual for cluster to have a single partition mounted on every node (via NFS or samba). You just need to copy the artifact on that partition and they will be directly accessible in each node. If it's not the case, you should ask your sysadmin to install it.
Then you will need to launch the application. Again, most clusters come
with MPI. The tools mpirun (or mpiexec) are not restricted to real MPI applications and will launch any script you want on several nodes.

Scheduling script execution in Eclipse

Does Eclipse provide a way to schedule the execution of a Java script? (e.g., if I need to schedule a script at 6 am, every day...)
Thanks.
I'm not familiar with any built-in service. Nevertheless, Eclipse is written in Java, so there's no problem writing your own Java code for doing this. You can use the Timer class to schedule your code execution.
You "script" should be enclosed in an Eclipse plug-in. Use the Activator.start to schedule your process when the plugin loads.