Quartz scheduler RAM vs JDBC for job store - quartz-scheduler

Can both RAM and JDBC job stores be used together? I wasn't able to find an answer to this in the official documentation.

No, they couldnt be used together.
Usually a scheduler will be obtained by the StdSchedulerFactory.
These factory uses a java.util.Properties class to determine the settings of the scheduler.
I.e. you could pass RAMJobstore as well as JDBCJobstore as value of the key org.quartz.jobStore.class, but in that case the key would be overloaded and only the last value will be visible.

Related

How should I slice and orchestrate a configurable batch network using Spring Batch and Spring Cloud Data Flow?

We would like to migrate the scheduling and sequence control of some Kettle import jobs from a proprietary implementation to a Spring Batch flavour, good practice implementation.
I intend to use Spring Cloud Data Flow (SCDF) server to implement and run a configurable sequence of the existing external import jobs.
The SCDF console Task editor UI seems promising to assemble a flow. So one Task wraps one Spring Batch, which in a single step only executes a Tasklet starting and polling the Carte REST API. Does this make sense so far?
Would you suggest a better implementation?
Constraints and Requirements:
The external Kettle jobs are triggered and polled using Carte REST API. Actually, it's one single Kettle job implementation, called with individual parameters for each entity to be imported.
There is a configurable, directed graph of import jobs for several entities, some of them being dependent on a correct import of the previous entity type. (e.g. Department, then Employee, then Role assignments...)
With the upcoming implementation, we would like to get
monitoring and controlling (start, abort, pause, resume)
restartability
easy reconfigurability of the sequence in production (possibly by GUI, or external editor)
possibly some reporting and statistics.
As my current understanding, this could be achieved by using Spring Cloud Data Flow (SCDF) server, and some Task / Batch implementation / combination.
Correct me if I'm wrong, but a single Spring Batch job with its hardwired flow seems not very suitable to me. Or is there an easy way to edit and redeploy a Spring Batch with changed flow in production? I couldn't find anything, not even an easy to use editor for the XML representation of a batch.
Yes, I believe you can achieve your design goals using Spring Cloud Data Flow along with the Spring Cloud Task/Spring Batch.
The flow of multiple Spring Batch Jobs (using the Composed Task) can be managed using Spring Cloud Data Flow as you pointed from the other SO thread.
The external Kettle jobs are triggered and polled using Carte REST API. Actually, it's one single Kettle job implementation, called with individual parameters for each entity to be imported.
There is a configurable, directed graph of import jobs for several entities, some of them being dependent on a correct import of the previous entity type. (e.g. Department, then Employee, then Role assignments...)
Again, both the above can be managed as a Composed Task (with the composed task consisting of a regular task as well as Spring Batch based applications).
You can manage the parameters passed to each task/batch upon invocation via batch job parameters or task/batch application properties or simply command-line arguments.
With the upcoming implementation, we would like to get
monitoring and controlling (start, abort, pause, resume)
restartability
easy reconfigurability of the sequence in production (possibly by GUI, or external editor)
possibly some reporting and statistics.
Spring Cloud Data Flow helps you achieve these goads. You can visit the Task Developer Guide and the Task Monitoring Guide for more info.
You can also check the Batch developer guide from the site as well.

How to get all the Spark config along with default config?

I'm working on a project for which I need to collect all the Spark config.
The problem is that if a parameter is not explicitly set, I would need the default value. Is there a way to get all the config, including all defaults?
I tried with:
sc.getConf.getAll, but in this way, I do not get the defaults.
SparkListener, in order to get some specific information (for example the number of executors), but I couldn't find a way to get other needed information like the number of cores per executor, memory per executor etc...

Spark local variable broadcast to executor

var countryMap = Map("Amy" -> "Canada", "Sam" -> "US", "Bob" -> "Canada")
val names = List("Amy", "Sam", "Eric")
sc.parallelize(names).flatMap(broadcastMap.value.get).collect.foreach(println)
//output
Canada
US
I'm running this spark job in YARN mode, and I'm sure that driver and executors are not in the same node/JVM (see the attached pic). Since countryMap is not a broadcast variable, so the executor should not see it and this code shouldn't print anything. However, it printed Canada and US.
My question is that does Spark populate local variables to executors automatically if they are serializable? if not, how does the executor see the driver's local variables?
Hay Edwards,
when you invoke collect that bring result set back to driver try to perform mapping. that the reason that you could find mappings get generated.
Cheers,
local variables : The driver and each executor,no serialization required,Shared within the actuator/driver.
main variables: The driver and each copy(insulate) of the task,need serialization
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
reference https://spark.apache.org/docs/2.3.0/rdd-programming-guide.html#broadcast-variables
Basically, local variables in the driver will be broadcast to executors automatically. However, you need to create broadcast variables when you need them across different stages.
To have broadcastMap.value.get function running on a cluster, Spark needs to serialize broadcastMap and send to every executor, so you have a function with data already attached to it in the form an object instance. If you make broadcastMap class unserializable - you won't be able to run this code whatsoever.
So, Spark doesn't populate local variables to executors, but rather you explicitly tell it to serialize object broadcastMap and run a method of that object distributedly.

Hive configuration for Spark integration tests

I am looking for a way to configure Hive for Spark SQL integration testing such that tables are written either in a temporary directory or somewhere under the test root. My investigation suggests that this requires setting both fs.defaultFS and hive.metastore.warehouse.dir before HiveContext is created.
Just setting the latter, as mentioned in this answer is not working on Spark 1.6.1.
val sqlc = new HiveContext(sparkContext)
sqlc.setConf("hive.metastore.warehouse.dir", hiveWarehouseDir)
The table metadata goes in the right place but the written files go to /user/hive/warehouse.
If a dataframe is saved without an explicit path, e.g.,
df.write.saveAsTable("tbl")
the location to write files to is determined via a call to HiveMetastoreCatalog.hiveDefaultTableFilePath, which uses the location of the default database, which seems to be cached during the HiveContext construction, thus setting fs.defaultFS after HiveContext construction has no effect.
As an aside, but very relevant for integration testing, this also means that DROP TABLE tbl only removes the table metadata but leaves the table files, which wreaks havoc with expectations. This is a known problem--see here & here--and the solution may be to ensure that hive.metastore.warehouse.dir == fs.defaultFS + user/hive/warehouse.
In short, how can configuration properties such as fs.defaultFS and hive.metastore.warehouse.dir be set programmatically before the HiveContext constructor runs?
In Spark 2.0 you can set "spark.sql.warehouse.dir" on the SparkSession's builder, before creating a SparkSession. It should propagate correctly.
For Spark 1.6, I think your best bet might be to programmatically create a hite-site.xml.
The spark-testing-base library has a TestHiveContext configured as part of the setup for DataFrameSuiteBaseLike. Even if you're unable to use scala-testing-base directly for some reason, you can see how they make the configuration work.

Talend job batch processing

I am exploring Talend at work, I was asked if Talend supports batch processing as in running the job in multiple threads. After going through the user guide I understood threading is possible with sub jobs. I would like to know if it is possible to run the a job with a single action in parallel
Talend has excellent multi threading support. There are two basic methods for this. One method gives you more control and is implemented using components. The other method is implemented as job setting.
For the first method see my screenshot. I use tParallelize to load three files into three tables at the same time. Then when all three files are successfully loaded I use the same tParallelize to set the values of a control table. tParallelize can also be connected to tRunJob as easily as a subjob.
The other method is described very well here in Talend Help: Talend Help- Run Jobs in Parallel
Generally I recommend the first method because of the control it gives you, but if your job follows the simple pattern described in the help link, that method works as well.