Tried to find if this was asked before but couldn't.
Here is the problem. The following has to be achieved via Spring batch
There is one file to be read and processed. The item reader is not thread safe.
The plan is to have multithreaded homogenous processors and multithreaded homogenous writers injest items read by a single threaded reader.
Kind of like below:
----------> Processor #1 ----------> Writer #1
|
Reader -------> Processor #2 ----------> Writer #2
|
----------> Processor #3 ----------> Writer #3
Tried AsyncItemProcessor and AsyncItemWriter, but holding debug point on processor resulted in reader not being executed until the point was released i.e. single threaded processing.
Task executor was tried like below:
<tasklet task-executor="taskExecutor" throttle-limit="20">
Multiple threads on the reader were launched.
Synchronising the reader also didn't work.
I tried to read about partitioner but it seemed complex.
Is there an annotation to mark the reader as single threaded? Would pushing read data to Global context be a good idea?
Please guide towards a solution.
I guess nothing is in built in Spring Batch API for the pattern that you are looking for. Coding on your part would be needed to achieve what you are looking for.
Method ItemWriter.write already takes a List of processed items based on your chunk size so you can divide up that List into as many threads as you like. You spawn your own threads and pass a segment of list to each of threads to write .
Problem is with method ItemProcesor.process() as it processes item by item so you are limited by a single item and you wouldn't be able to much of a threading for a single item.
So challenge is to write your own reader than can hand over a list of items to processor instead of a single item so you can process those items in parallel & writer will work on a list of list.
In all of this set up, you have to remember that threads spawned by you will be out of read - process - write transaction boundary of Spring batch so you will have to take care of that on your own - in terms of merging processed output for all threads and waiting till all threads are complete and handling any errors. All in all, its very risky.
Making a item reader to return a list instead single object - Spring batch
Came across this with a similar problem at hand.
Here's how I am doing it at the moment. As #mminella suggested, synchronized itemReader with the flatfileItemReader as delegate. This works with decent performance. The code writes about ~4K records per second at the moment but the speed doesn't entirely depend on the design, other attributes contribute as well.
Tried other approaches to increase performance, both kind of failed.
Custom Synchronized ItemReader that aggregates with FlatFileItemReader as delegate but I ended up with maintaining a lot a state that caused performance drop. Maybe the code needed optimization or Synchronization is just faster.
Fired each insert PreparedStatement batch in different thread but didn't increase much performance but I still am counting on this in case I run into an environment where individual threads for batches would result in significant performance boost.
Related
I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).
We have Scheduled jobs that runs daily,This jobs looks for matching Documents for that day and takes the document and do minimal transform and sent it a queue for downstream processing. Typically we have 4 millions Documents to be processed for a day. Our aim is to complete the processing within one hour. I am looking for suggestions on the best practices to read 4 million Documents from MongoDB quickly ?
The MongoDB Async driver is the first stop for low overhead querying. There's a good example of using the SingleResultCallback on that page:
Block<Document> printDocumentBlock = new Block<Document>() {
#Override
public void apply(final Document document) {
System.out.println(document.toJson());
}
};
SingleResultCallback<Void> callbackWhenFinished = new SingleResultCallback<Void>() {
#Override
public void onResult(final Void result, final Throwable t) {
System.out.println("Operation Finished!");
}
};
collection.find().forEach(printDocumentBlock, callbackWhenFinished);
It is a common pattern in asynchronous database drivers to allow results to be passed on for processing as soon as they are available. The use of OS-level async I/O will help with low CPU overhead. Which brings up the next problem - how to get the data out.
Without seeing the specifics of your work, you probably want to place the results into an in memory queue to be picked up by another thread at this point so the reader thread can keep reading results. An ArrayBlockingQueue is probably appropriate. put is more appropriate than add because it will block the reader thread if the worker(s) isn't able to keep up (keeping things balanced). Ideally, you don't want it to back up which is where multiple threads will be necessary. If the order of the results is important, use a single worker thread, otherwise use a ThreadPoolExecutor with the queue passed into the constructor. Using the in-memory queue does open up the possibility for data-loss if the results are being somehow discarded as they are read (i.e. if you were immediately sending off another query to delete them), and the reader process crashed.
At this point, either do the 'minimal transforms' on the worker thread(s), or serialize them in the workers and put them on a real queue (e.g. RabbitMQ, ZeroMQ). Putting them onto a real queue allows the work to be divided up amoungst multiple machines trivially, and provides optional persistence allowing recovery of work, and those queues have great clustering options for scalability. Those machines can then put the results into the queue you mentioned in the question (assuming it has the capacity).
The bottleneck in a system like that is how quickly one machine can get through a single mongo query, and how many results the final queue can handle. All the other parts (MongoDB, queues, # of worker machines) are individually scalable. By doing as little work as possible on the querying machine and pushing that work onto other machines that impact can be greatly reduced. It sounds like your destination queue is out of your control.
When trying to work out where bottlenecks are, measurements are critical. Adding metrics to your application up front will let you know which areas need improvement when things aren't going well.
That set-up can build a pretty scalable system. I've built many similar systems before. Beyond that, you'll want to investigate getting your data into something like Apache Storm.
When learning the Writer Monad, I found it will collect the log value of each step, and then combine them together.
I have a question about the performance: When can I log them? If a method runs thousands of times, and it will hold many many strings in memory for a long time. I can only log them to file after it returns.
How do we use Writer Monad to log logs in real world? Is there any way to log the logs just in time?
Generally, no. Writer is intended to make logging pure and introspectable; flushing parts of it would violate that.
In practice you could probably make a Monoid that violates the laws by e.g. writing out the list whenever it gets too long. But that would sacrifice most of the benefits of using Writer in the first place.
I'd recommend benchmarking your requirements and explicitly runing and logging out the Writer at a granular, reasonable level e.g. per web request or per logical transaction.
I am writing a project that will be generating reports. It would read all requests from a database by making a rest call, based on the type of request it will make a rest call to an endpoint, after getting the response it will save the response in an object and save it back to the database by making a call to an endpoint.
I am using spring-batch to handle the batch work. So far what I came up with is a single job (reader, processor, writer) that will do the whole things. I am not sure if this is the correct design considering
I do not want to queue up requests if some request is taking a long time to get a response back. [not sure yet]
I do not want to hold up saving response until all the responses are received. [using commit-internal will help]
If the job crashes for some reason, how can I restart the job [maybe using batch-admin will help but what are my other options]
By using chunk oriented processing Reader, Processor and Writer get executed in order until Reader has nothing to return.
If you can read one item at a time, process it and send it back to the endpoint that handles the persistence this approach is handy.
If you must read ALL the information at once the reader will get a big collection with all items and pass it to processor. The processor will process all the items and send the result to the writer. You cannot send just a few to the writer so you would have to do the persistence directly from processor and that would be against the design.
So, as I understand this, you have two options:
Design a reader that can read one item at a time. Use the chunk oriented processing that you already started to read one item, process it and send it back for persistence. Have a look at how other readers are implemented (like JdbcCursorItemReader).
You create a tasklet that reads the whole collection of items process it and sends them back for processing. You can break this in different tasklets.
commit-interval only controls after how many items transaction is commited. So it will not help you as all the processing and persistence is done by calling rest services.
I have figured out a design and I think it will work fine.
As for the questions that I asked, following are the answers:
Using asynchronous processors will help avoiding any queue.
http://docs.spring.io/spring-batch/trunk/reference/html/springBatchIntegration.html#asynchronous-processors
using commit-internal will solve it
This thread has the answer - Spring batch :Restart a job and then start next job automatically
I'm coming from Java, where I'd submit Runnables to an ExecutorService backed by a thread pool. It's very clear in Java how to set limits to the size of the thread pool.
I'm interested in using Scala actors, but I'm unclear on how to limit concurrency.
Let's just say, hypothetically, that I'm creating a web service which accepts "jobs". A job is submitted with POST requests, and I want my service to enqueue the job then immediately return 202 Accepted — i.e. the jobs are handled asynchronously.
If I'm using actors to process the jobs in the queue, how can I limit the number of simultaneous jobs that are processed?
I can think of a few different ways to approach this; I'm wondering if there's a community best practice, or at least, some clearly established approaches that are somewhat standard in the Scala world.
One approach I've thought of is having a single coordinator actor which would manage the job queue and the job-processing actors; I suppose it could use a simple int field to track how many jobs are currently being processed. I'm sure there'd be some gotchyas with that approach, however, such as making sure to track when an error occurs so as to decrement the number. That's why I'm wondering if Scala already provides a simpler or more encapsulated approach to this.
BTW I tried to ask this question a while ago but I asked it badly.
Thanks!
I'd really encourage you to have a look at Akka, an alternative Actor implementation for Scala.
http://www.akkasource.org
Akka already has a JAX-RS[1] integration and you could use that in concert with a LoadBalancer[2] to throttle how many actions can be done in parallell:
[1] http://doc.akkasource.org/rest
[2] http://github.com/jboner/akka/blob/master/akka-patterns/src/main/scala/Patterns.scala
You can override the system properties actors.maxPoolSize and actors.corePoolSize which limit the size of the actor thread pool and then throw as many jobs at the pool as your actors can handle. Why do you think you need to throttle your reactions?
You really have two problems here.
The first is keeping the thread pool used by actors under control. That can be done by setting the system property actors.maxPoolSize.
The second is runaway growth in the number of tasks that have been submitted to the pool. You may or may not be concerned with this one, however it is fully possible to trigger failure conditions such as out of memory errors and in some cases potentially more subtle problems by generating too many tasks too fast.
Each worker thread maintains a dequeue of tasks. The dequeue is implemented as an array that the worker thread will dynamically enlarge up to some maximum size. In 2.7.x the queue can grow itself quite large and I've seen that trigger out of memory errors when combined with lots of concurrent threads. The max dequeue size is smaller 2.8. The dequeue can also fill up.
Addressing this problem requires you control how many tasks you generate, which probably means some sort of coordinator as you've outlined. I've encountered this problem when the actors that initiate a kind of data processing pipeline are much faster than ones later in the pipeline. In order control the process I usually have the actors later in the chain ping back actors earlier in the chain every X messages, and have the ones earlier in the chain stop after X messages and wait for the ping back. You could also do it with a more centralized coordinator.