I run a simple controller with spring to test quartz capabilities.
#PostMapping(path = ["/api/v1/start/{jobKey}/{jobGroup}"])
fun start(#PathVariable jobKey: String, #PathVariable jobGroup: String): ResponseEntity<String> {
val simpleJob = JobBuilder
.newJob(SampleJob::class.java)
.requestRecovery(true)
.withIdentity(JobKey.jobKey(jobKey, jobGroup))
.build()
val sampleTrigger = TriggerBuilder
.newTrigger()
.withIdentity(jobKey, jobGroup)
.withSchedule(
SimpleScheduleBuilder
.repeatSecondlyForever(5)
.withMisfireHandlingInstructionIgnoreMisfires())
.build()
val scheduler = factory.scheduler
scheduler.jobGroupNames.contains(jobGroup)
if (scheduler.jobGroupNames.contains(jobGroup)) {
return ResponseEntity.ok("Scheduler exists.")
}
scheduler.scheduleJob(simpleJob, sampleTrigger)
scheduler.start()
return ResponseEntity.ok("Scheduler started.")
}
#PostMapping(path = ["/api/v1/stop/{jobKey}/{jobGroup}"])
fun stop(#PathVariable jobKey: String, #PathVariable jobGroup: String): String {
val scheduler = factory.scheduler
scheduler.interrupt(JobKey.jobKey(jobKey, jobGroup))
val jobGroupNames = scheduler.jobGroupNames
logger.info("Existing jobGroup names: {}", jobGroupNames)
return scheduler.deleteJob(JobKey.jobKey(jobKey, jobGroup)).toString()
}
Then I start two applications on different ports with the same code and start playing with it. Let's call them APP1 and APP2
I use PostgreSQL as JobStore.
So I run several scenarios.
1) Create the job with group1 and key1 in APP1
2) Try to create a job with group1 and key1 in APP2. - it gives the error that the job already started. The behavior is like I expected.
3) Stop APP1. I expect that the job will be executed in APP2, as it still exists in JobStore, but it didn't. Do I need to provide some additional configuration?
4) Start APP1, also nothing happens. Furthermore, the record for group1 and key1 still presented in DB and can't be started.
Do I need to modify shutdown behavior to remove job on the application shutdown and start jobs in another application? or I just need to configure the trigger in some another correct way?
My bad, that was a silly problem. I forget to start sheduler in my application
#Bean
open fun schedulerFactory(): SchedulerFactory {
val factory = StdSchedulerFactory()
factory.initialize(ClassPathResource("quartz.properties").inputStream)
factory.scheduler.start() // this line was missed
return factory
}
Related
I'm following the docs: Spring Batch Integration combining with the Integration AWS for pooling the AWS S3.
But the batch execution per each file is not working in some situations.
The AWS S3 Pooling is working correctly, so when I put a new file or when I started the application and there's files in the bucket the application sync with the local directory:
#Bean
public S3SessionFactory s3SessionFactory(AmazonS3 pAmazonS3) {
return new S3SessionFactory(pAmazonS3);
}
#Bean
public S3InboundFileSynchronizer s3InboundFileSynchronizer(S3SessionFactory pS3SessionFactory) {
S3InboundFileSynchronizer synchronizer = new S3InboundFileSynchronizer(pS3SessionFactory);
synchronizer.setPreserveTimestamp(true);
synchronizer.setDeleteRemoteFiles(false);
synchronizer.setRemoteDirectory("remote-bucket");
//synchronizer.setFilter(new S3PersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "simpleMetadataStore"));
return synchronizer;
}
#Bean
#InboundChannelAdapter(value = IN_CHANNEL_NAME, poller = #Poller(fixedDelay = "30"))
public S3InboundFileSynchronizingMessageSource s3InboundFileSynchronizingMessageSource(
S3InboundFileSynchronizer pS3InboundFileSynchronizer) {
S3InboundFileSynchronizingMessageSource messageSource = new S3InboundFileSynchronizingMessageSource(pS3InboundFileSynchronizer);
messageSource.setAutoCreateLocalDirectory(true);
messageSource.setLocalDirectory(new FileSystemResource("files").getFile());
//messageSource.setLocalFilter(new FileSystemPersistentAcceptOnceFileListFilter(new SimpleMetadataStore(), "fsSimpleMetadataStore"));
return messageSource;
}
#Bean("s3filesChannel")
public PollableChannel s3FilesChannel() {
return new QueueChannel();
}
I followed the tutorial so I created the FileMessageToJobRequest I won't put the code here because it's the same as the docs
So I created the beans IntegrationFlow and FileMessageToJobRequest:
#Bean
public IntegrationFlow integrationFlow(
S3InboundFileSynchronizingMessageSource pS3InboundFileSynchronizingMessageSource) {
return IntegrationFlows.from(pS3InboundFileSynchronizingMessageSource,
c -> c.poller(Pollers.fixedRate(1000).maxMessagesPerPoll(1)))
.transform(fileMessageToJobRequest())
.handle(jobLaunchingGateway())
.log(LoggingHandler.Level.WARN, "headers.id + ': ' + payload")
.get();
}
#Bean
public FileMessageToJobRequest fileMessageToJobRequest() {
FileMessageToJobRequest fileMessageToJobRequest = new FileMessageToJobRequest();
fileMessageToJobRequest.setFileParameterName("input.file.name");
fileMessageToJobRequest.setJob(delimitedFileJob);
return fileMessageToJobRequest;
}
So in the JobLaunchingGateway I think is the problem:
If I created like this:
#Bean
public JobLaunchingGateway jobLaunchingGateway() {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
simpleJobLauncher.setTaskExecutor(new SyncTaskExecutor());
JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(simpleJobLauncher);
return jobLaunchingGateway;
}
Case 1 (Bucket is empty when the application starts):
I upload a new file in the AWS S3;
The pooling works and the file appears in the local directory;
But the transform/job isn't fired;
Case 2 (Bucket already has one file when application starts):
The job is launched:
2021-01-12 13:32:34.451 INFO 1955 --- [ask-scheduler-1] o.s.b.c.l.support.SimpleJobLauncher : Job: [SimpleJob: [name=arquivoDelimitadoJob]] launched with the following parameters: [{input.file.name=files/FILE1.csv}]
2021-01-12 13:32:34.524 INFO 1955 --- [ask-scheduler-1] o.s.batch.core.job.SimpleStepHandler : Executing step: [delimitedFileJob]
If I add a second file in S3, the job isn't launched as the case 1.
Case 3 (Bucket has more than one file):
The files are synchronized correctly in local directory
But the job is only executed once for the last file.
So following the docs I change my Gateway to:
#Bean
#ServiceActivator(inputChannel = IN_CHANNEL_NAME, poller = #Poller(fixedRate="1000"))
public JobLaunchingGateway jobLaunchingGateway() {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
simpleJobLauncher.setTaskExecutor(new SyncTaskExecutor());
//JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(jobLauncher());
JobLaunchingGateway jobLaunchingGateway = new JobLaunchingGateway(simpleJobLauncher);
//jobLaunchingGateway.setOutputChannel(replyChannel());
jobLaunchingGateway.setOutputChannel(s3FilesChannel());
return jobLaunchingGateway;
}
With this new gateway implementation, if I put a new file in S3 the application reacts but didn't transform giving the error:
Caused by: java.lang.IllegalArgumentException: The payload must be of type JobLaunchRequest. Object of class [java.io.File] must be an instance of class org.springframework.batch.integration.launch.JobLaunchRequest
And if there's two files in the bucket (when the apps starts) FILE1.csv and FILE2.csv, the job runs for the FILE1.csv correctly, but give the error above for the FILE2.csv.
What's the correct way to implement something like this?
Just to be clear I want to receive thousand of csv files in this bucket, read and process with Spring Batch, but I also need to get every new file asap from S3.
Thanks in advance.
The JobLaunchingGateway indeed expects from us only JobLaunchRequest as a payload.
Since you have that #InboundChannelAdapter(value = IN_CHANNEL_NAME, poller = #Poller(fixedDelay = "30")) on the S3InboundFileSynchronizingMessageSource bean definition, it is really wrong to have then #ServiceActivator(inputChannel = IN_CHANNEL_NAME for that JobLaunchingGateway without FileMessageToJobRequest transformer in between.
Your integrationFlow looks OK for me, but then you really need to remove that #InboundChannelAdapter from the S3InboundFileSynchronizingMessageSource bean and fully rely on the c.poller() configuration.
Another way is to leave that #InboundChannelAdapter, but then start the IntegrationFlow from the IN_CHANNEL_NAME not a MessageSource.
Since you have several poller against the same S3 source, plus both of then are based on the same local directory, it is not a surprise to see so many unexpected situations.
I'm trying to connect two streams, first is persisting in MapValueState:
RocksDB save data in checkpoint folder, but after new run, state is empty. I run it locally and in flink cluster with cancel submiting in cluster and simply rerun locally
env.setStateBackend(new RocksDBStateBackend(..)
env.enableCheckpointing(1000)
...
val productDescriptionStream: KeyedStream[ProductDescription, String] = env.addSource(..)
.keyBy(_.id)
val productStockStream: KeyedStream[ProductStock, String] = env.addSource(..)
.keyBy(_.id)
and
productDescriptionStream
.connect(productStockStream)
.process(ProductProcessor())
.setParallelism(1)
env.execute("Product aggregator")
ProductProcessor
case class ProductProcessor() extends CoProcessFunction[ProductDescription, ProductStock, Product]{
private[this] lazy val stateDescriptor: MapStateDescriptor[String, ProductDescription] =
new MapStateDescriptor[String, ProductDescription](
"productDescription",
createTypeInformation[String],
createTypeInformation[ProductDescription]
)
private[this] lazy val states: MapState[String, ProductDescription] = getRuntimeContext.getMapState(stateDescriptor)
override def processElement1(value: ProductDescription,
ctx: CoProcessFunction[ProductDescription, ProductStock, Product]#Context,out: Collector[Product]
): Unit = {
states.put(value.id, value)
}}
override def processElement2(value: ProductStock,
ctx: CoProcessFunction[ProductDescription, ProductStock, Product]#Context, out: Collector[Product]
): Unit = {
if (states.contains(value.id)) {
val product =Product(
id = value.id,
description = Some(states.get(value.id).description),
stock = Some(value.stock),
updatedAt = value.updatedAt)
out.collect(product )
}}
Checkpoints are created by Flink for recovering from failures, not for resuming after a manual shutdown. When a job is canceled, the default behavior is for Flink to delete the checkpoints. Since the job can no longer fail, it won't need to recover.
You have several options:
(1) Configure your checkpointing to retain checkpoints when a job is cancelled:
CheckpointConfig config = env.getCheckpointConfig();
config.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
Then when you restart the job you'll need to indicate that you want it restarted from a specific checkpoint:
flink run -s <checkpoint-path> ...
Otherwise, whenever you start a job it will begin with an empty state backend.
(2) Instead of canceling the job, use stop with savepoint:
flink stop [-p targetDirectory] [-d] <jobID>
after which you'll again need to use flink run -s ... to resume from the savepoint.
Stop with a savepoint is a cleaner approach than relying on there being a recent checkpoint to fall back to.
(3) Or you could use Ververica Platform Community Edition, which raises the level of abstraction to the point where you don't have to manage these details yourself.
I am accessing a state store to query it and have had to wrap the store() statement with a try/catch block to retry it because sometimes I am getting this exception:
org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get state store customers-store because the stream thread is PARTITIONS_REVOKED, not RUNNING
at org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:49)
at org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:57)
at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1053)
at com.codependent.kafkastreams.customer.service.CustomerService.getCustomer(CustomerService.kt:75)
at com.codependent.kafkastreams.customer.service.CustomerServiceKt.main(CustomerService.kt:108)
This is the code used to retrieve the store (the full code is on a github repo):
fun getCustomer(id: String): Customer? {
var keyValueStore: ReadOnlyKeyValueStore<String, Customer>? = null
while(keyValueStore == null) {
try {
keyValueStore = streams.store(CUSTOMERS_STORE, QueryableStoreTypes.keyValueStore<String, Customer>())
} catch (ex: InvalidStateStoreException) {
ex.printStackTrace()
}
}
val customer = keyValueStore.get(id)
return customer
}
And this is the main program:
fun main(args: Array<String>) {
val customerService = CustomerService("main", "localhost:9092")
customerService.initializeStreams()
customerService.createCustomer(Customer("53", "Joey"))
val customer = customerService.getCustomer("53")
println(customer)
customerService.stopStreams()
}
The exception happens randomly running the program several times, after the previous executions finish. Note: I don't do anything to the executing Kafka cluster and use its default config.
At the time you are accessing the store, the Kafka Streams application is going through a rebalance, and state stores aren't accessible at that time. You want to make sure you only query the stores when the application state is RUNNING and not REBALANCING.
What you could do is check the state of the application before attempting to read from the store like this:
if(streams.state() == State.RUNNING) {
keyValueStore = streams.store(...);
val customer = keyValueStore.get(id);
return customer;
}
There is also a KafkaStreams.setStateListener method you can use to register a KafkStreams.StateListener implementation. The StateListener.onChange method is called each time the application changes its state.
I don't know how to pass SparkSession parameters programmatically when submitting Spark job to Apache Livy:
This is the Test Spark job:
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// ...
}
}
This is how this Spark job is submitted to Livy:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
How can I pass the following configuration parameters to SparkSession?
.config("es.nodes","1localhost")
.config("es.port",9200)
.config("es.nodes.wan.only","true")
.config("es.index.auto.create","true")
You can do that easily through the LivyClientBuilder like this:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("key", "value")
.build()
Configuration parameters can be set to LivyClientBuilder using
public LivyClientBuilder setConf(String key, String value)
so that your code starts with:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.setConf("es.nodes","1localhost")
.setConf("es.port",9200)
.setConf("es.nodes.wan.only","true")
.setConf("es.index.auto.create","true")
.build()
LivyClientBuilder.setConf will not work, I think. Because Livy will modify all configs not starting with spark.. And Spark cannot read the modified config. See here
private static File writeConfToFile(RSCConf conf) throws IOException {
Properties confView = new Properties();
for (Map.Entry<String, String> e : conf) {
String key = e.getKey();
if (!key.startsWith(RSCConf.SPARK_CONF_PREFIX)) {
key = RSCConf.LIVY_SPARK_PREFIX + key;
}
confView.setProperty(key, e.getValue());
}
...
}
So the answer is quite simple: add spark. to all es configs, like this,
.config("spark.es.nodes","1localhost")
.config("spark.es.port",9200)
.config("spark.es.nodes.wan.only","true")
.config("spark.es.index.auto.create","true")
Don't know it is elastic-spark does the compatibility job, or spark. It just works.
PS: I've tried with the REST API, and it works. But not with the Programmatic API.
I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()