Using cosmos db for Spring batch job repository - spring-batch

Is it possible to use CosmosDB as a job repository for Spring Batch?
If that is not possible, can we go with an in-memory DB to handle our Spring batch jobs?
The job itself is triggered on message arrival in a remote queue. We use a variation of the process indicator in our current Spring batch job, to keep track of "chunks" which are being processed. Our attributes for saveStep are also disabled . The reader always uses a DB query to avoid picking up the same chunks and prevent duplicate processing.
We don't commit the message on the queue , till all records for that job are processed. So if the node dies and comes back up in the middle of processing , the same message would be redelivered , which takes of job restarts. Given all this, we have a choice of either coming up with a way to implement a cosmos job repository or simply use in-memory and plug in an "afterJob" listener to clean up the in-memory job data to ensure that java mem is not used in Prod. Any recommendations?

Wanted to provide information that Azure Cosmos DB just release v3 of the Spring Data connector for the SQL API:
The Spring on Azure team, in partnership with the Azure Cosmos DB team, are proud to have just made the Spring Data Azure Cosmos DB v3 generally available. This is the latest version of Azure Cosmos DB’s SQL API Spring Data connector.
Also, Spring.io has an example microservices solution (Spring Cloud Data Flow) based on batch that could be used as an example for your solution.
Additional Information:
Spring Data Azure Cosmos DB v3 for Core (SQL) API: Release notes and resources (link)
A well written 3rd party blog that is super helpful:
Introduction to Spring Data Azure Cosmos DB (link)

Related

Spring Batch and Azure Cosmos DB

I am planning to use Spring Batch on Azure as a serverless and looking to explore the Cosmos DB to store and manipulate the data.
Can I still use the Spring Batch Metadata tables with the COSMOS DB? If not, where to store the Spring Batch Metadata tables details?
How can we scheduled Batch Job on Azure? Is there any complete working example?
CosmosDB is not a supported database in Spring Batch, but you may be able to use one of the supported types, if the SQL variant is close enough.
Please refer to the Non-standard Database Types in a Repository section of the documentation for more details and a sample.

ETL Spring batch, Spring cloud data flow (SCDF)

We have a use case where data can be sourced from different sources (DB, FILE etc) and transformed and stored to various sinks (Cassandra, DB or File).We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I am new to SCDF and Spring batch and wondering what is the best way to use it.
Is there a way to provide configuration for these jobs (source connection details, table and query) and can this be done through an UI (SCDF Server UI ?). Is it possible to compose the flow?
This will run on Kubernetes and our applications are deployed through Jenkins pipeline.
We would want the ability to split the jobs and do parallel loads - looks like Spring Batch RemoteChunking provides that ability.
I don't think you need remote chunking, you can rather run parallel jobs, where each job handles an ETL process (for a particular file, db table).
Is there a way to provide configuration for these jobs (source connection details, table and query)
Yes, those can be configured like any regular Spring Batch job is configured.
and can this be done through an UI (SCDF Server UI ?
If you make them configurable through properties of your job, you can specify them through the UI when you run the task.
Is it possible to compose the flow?
Yes, this is possible with Composed Task.

How to use Spring Cloud Dataflow to get Spring Batch Status

I have been using Spring Batch and my metadata is in DB2. I have been using Spring Batch admin API (jars) to look at the current status of various jobs and getting details about job, like number of items read, commit count, etc. Now, since Spring Batch Admin is moved to spring-data-cloud, how do look at these informations? Is there a good API set I could use?
Basically, in Spring Cloud Data flow, you first need to create Spring Cloud Task that will have your Batch application: See example [here][1]
With the help of Spring Cloud #EnableTaskLauncher you can get the current status of job, run the job, stop the job, etc.
You need to send TasKLauncherRequest for it.
See APIs of TaskLauncher
Edition:
To get spring batch status, u need to have first Task execution id of spring cloud task. Set<Long> getJobExecutionIdsByTaskExecutionId(long taskExecutionId); method of [TaskExplorer][3]
See Task Explorer for all the apis. With it, use JobExplorer to get status of jobs

Spring batch meta data tables in mongo database

I tried to have spring batch meta data tables in Mongo database but its not working correctly. I referred and used below mentioned github project to configure JobRepository to store job data in Mongodb. This GitHub project is updated last 3 years ago and looks discontinued.
https://github.com/vfouzdar/springbatch-mongoDao
https://jbaruch.wordpress.com/2010/04/27/integrating-mongodb-with-spring-batch/
Currently my application uses in-memory tables for spring batch and functional part is done. But I want job data to be stored in Mongodb.
I have already used Mysql for spring batch job data but in current application don't want mysql.
If anybody has any other solution/link which can help me, please share.

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.