Is there a way to run quartz scheduler without conflict in clustered env without jdbc job store . I am not thinking of maintaining the state either in user defined tables
Take a look at Clustering Quartz Scheduler with Terracotta
You can find an example in Quartz distribution (Look for Example 15 - TC Clustered Quartz)
Related
I have a production and sandbox servers, each running in their own Tomcat cluster. The production applications include a quartz app that handles scheduling.
My question is: when setting up quartz for the sandbox it seems I have two options:
Configure the quartz running on the sandbox to use the same quartz MySQL db as the one used by production production but assign a different org.quartz.scheduler.instanceName are kept separate.
Create a completely separate MySQL db schema for the sandbox quartz (and possibly also use a different instanceName as well)
Note: I do want the quartz applications to be running separately, even if they are pointing to the same db (with same or different schemas).
Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api.
To clarify:
I have to use LightGBM in general, there is no option here.
The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks
Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.
Is there any redis jobStore able to support a quartz cluster?
Have anybody been able to build that?
By other side, what's exactly a quartz cluster? I mean, is it able to have two services running the same quartz.properties file pointing to a redis?
EDIT
I've tried with this redis job store but it seems doesn't supprt quartz clustering:
JobStore class 'net.joelinn.quartz.jobstore.RedisJobStore' props could not be configured. [See nested exception: java.lang.NoSuchMethodException: No setter for property 'isClustered']
quartz.properties:
org.quartz.scheduler.instanceName=office-scheduler-service
org.quartz.scheduler.instanceId=AUTO
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
# thread-pool
org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount=2
org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread=true
org.quartz.jobStore.class = net.joelinn.quartz.jobstore.RedisJobStore
org.quartz.jobStore.host = redisbo
org.quartz.jobStore.misfireThreshold = 60000
you don't need to configure cluster, please check the source code, it is already clustered
Quartz JDBC documentation explains how it handles executing jobs in a cluster of application nodes. RedisJobStore extended that to utilize the Redis storage, and it will work in a cluster mode (Quartz cluster - not Redis cluster) by default without requiring you to enable that.
Basically Quartz uses a shared database to record which scheduler instance is currently working on a job, as opposed to direct node communication among application schedulers. When a scheduler instance picks up a job, it safely registers its instance id with the running job and persists it in the database. This support by the job store is evident in the schema used by RedisJobStore, indicated by the blocked_by fields.
I am working to migrate from Quartz 1.6 to 2.1 and use a JDBCJobStore. Previously, the the jobs were loaded via an xml file when the webapp started. The scheduler is now running using the JDBCJobStore but I don't understand how to add the jobs to the database which need to run on an ongoing basis (not one-off jobs).
My first thought is to create a servlet which runs on startup which adds the jobs to the database. But my concern is that this will be executed every time I need to restart the app and the jobs will get duplicated.
Thanks,
steve
The Jobs wont disappear from the database when you do a restart. So within your servlet, when it starts up before adding any jobs check to see if they already exist. When you create your jobs you can give them identities. Using the identities and some quartz methods you check if they already exist.
It sounds like the memory based scheduler is a better fit for these fixed jobs. You can create more than one scheduler, one memory, one JDBC if that makes sense for your application.
I am using
SchedulerFactory schedulerFactory = new StdSchedulerFactory();
scheduler = schedulerFactory.getScheduler();
scheduler.start();
Trigger asapTrigger = getAsapTrigger();
JobDetail asapJob = getAsapJobDetails();
scheduler.scheduleJob(asapJob, asapTrigger);
This is working but when I go for cluster environment, 2 threads are running for the same job.
I am using annotations not properties file. I want to run only one thread. Can someone help on this. How to configure?
my code almost look like : http://k2java.blogspot.com/2011/04/quartz.html
You have to configure Quartz to run in a clustered environment. Clustering currently only works with the JDBC jobstore, and works by having each node of the cluster to share the same database.
Set the org.quartz.jobStore.isClustered property to true if you have multiple instances of Quartz that use the same set of database tables. This property is used to turn on the clustering features.
Set the org.quartz.jobStore.clusterCheckinInterval property (milliseconds) which is the frequency at which this instance checks in with the other instances of the cluster.
Set the org.quartz.scheduler.instanceId to AUTO so that each node in the cluster will have a unique instanceId.
Please note that each instance in the cluster should use the same copy of the quartz.properties file. Furthermore if you use clustering on separate machines ensure that their clocks are synchronized.
For more information check the official documentation which contains a sample properties file for a clustered scheduler.