We process lots of files(around 500) overnight and those files comes every few min. But when they come , it is in group of 30-50. Is it a good idea to launch job for each file or group them and process it using multithreaded step?
Instead of going multithreaded directly or job per file, I'd recommend using partitioning. Using the MultiResourcePartitioner, you can create a partition per file which means each file get's its own step. By doing this, you can avoid some of the threading complexities (step scope stateful components), and still maintain things like restartability and the independent execution of each file with in the "batch" (run of the job). You can read more about partitioning in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html
It looks like the order in which files are processed does not matter.
I would use an instance of the batch job per file and not a multi-threaded step. Some advantages of using separate job instances are
It is easier to implement that a multi-threaded step.
Errors in one file will not affect the processing of other files.
If your files are very large, you can implement a multi-threaded step to process records of one file in parallel. This is something I would consider only if performance is not to expectations.
Multi-threaded programming in general is hard. Spring batch does a good job of abstracting the complexities of parallel processing but I have found that there are usually nuances to deal with, so it is best to avoid multi-threaded steps if you can.
Related
Hi I am new to Azure data factory and not all familiar with the back-end processing that run behind the scenes. I am wondering if there is a performance impact to running couple of data flows in parallel when compared to having all the transformations in one data flow.
I am trying to stage some data with a not exists transformation. i have to do it for multiple tables. when i test ran two data flows in parallel the clusters were brought up together for both the data flows simultaneously. But I am not sure if this the best approach to distribute the loading of tables across couple of data flows or to have all the transformations in one data flow
1: If you execute data flows in a pipeline in parallel, ADF will spin-up separate Spark clusters for each based on the settings in your Azure Integration Runtime attached to each activity.
2: If you put all of your logic inside a single data flow, then it will all execute in that same job execution context on a single Spark cluster instance.
3: Another option is to execute the activities in serial in the pipeline. If you have set a TTL on the Azure IR configuration, then ADF will reuse the compute resources (VMs) but you will still a brand-new Spark context for each execution.
All are valid practices and which one you choose should be driven by your requirements for your ETL process.
No. 3 will likely take the longest time to execute end-to-end. But it does provide a clean separation of operations in each data flow step.
No. 2 could be more difficult to follow logically and doesn't give you much re-usability.
No. 1 is really similar to #3, but you run them all in parallel. Of course, not every end-to-end process can run in parallel. You may require a data flow to finish before starting the next, in which case you're back in #3 serial mode.
How does Datastage Parallelism help with Performance improvement? What is the relationship between Parallelism and Performance?
Thanks & Regards,
Subhasree
This question is very broad - please try to be nore specific next time.
There are several differnt parallel approaches in DataStage:
Pipeline Parallelism: Imagine a job where data get read from a database is being transformed and written to another database. While data is still read from the database some rows get transformed and some (have already been transformed) and are already written to the target.
Because you do not have to wait for a single step to finish this provides performamce.
Partitioning Parallelism: Data get read i.e. from a Sequential file and then will be split up into different data partitions (number of partitions is determined by the configuration file). Parallel stages also designed once will be instanciated one per partition and therefore extra threads will be spawned. These thread will be running in parallel and again provide a better performance (throughput).
Hope this helps.
What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.
The Celery docs section Performance and Strategies suggests that tasks with multiple 'steps' should be divided into subtasks for more efficient parallelization. It then mentions that (of course) there will be more message passing overhead, so dividing into subtasks may not be worth the overhead.
In my case, I have an overall task of retrieving a small image (150px x 115px) from a third party API, then uploading via HTTP to my site's REST API. I can either implement this as a single task, or divide up the steps of retrieving the image and then uploading it into two seperate tasks. If I go with seperate tasks, I assume I will have to pass the image as part of the message to the second task.
My question is, which approach should be better in this case, and how can I measure the performance in order to know for sure?
Since your jobs are I/O-constrained, dividing the task may increase the number of operations that can be done in parallel. The message-passing overhead is likely to be tiny since any capable broker should be able to handle lots of messages/second with only a few ms of latency.
In your case, uploading the image will probably take longer than downloading it. With separate tasks, the download jobs needn't wait for uploads to finish (so long as there are available workers). Another advantage of separation is that you can put each job on different queue and dedicate more workers as backed-up queues reveal themselves.
If I were to try to benchmark this, I would compare execution times using same number of workers for each of the two strategies. For instance 2 workers on the combined task vs 2 workers on the divided one. Then do 4 workers on each and so on. My inclination is that the separated task will show itself to be faster; especially when the worker count is increased.
Just wanted to ask whether it's true that parallel processing is faster than sequentially processing.
I've always thought that parallel processing is faster, so therefore, I did an experiment.
I benchmarked my scripts and found out that after doing a bunch of
sub add{
for ($x=0; $x<=200000; $x++){
$data[$x] = $x/($x+2);
}
}
threading seems to be slower by about 0.5 CPU secs on average. Is this normal or is it really true that sequentially processing is faster?
Whether parallel vs. sequential processing is better is highly task-dependent and you've already done the right thing: You benchmarked both and determined for your task (the one you benchmarked, not necessarily the one you actually want to do) which one is faster.
As a general rule, on a single processor, sequential processing tends to be better for tasks which are CPU-bound, because if you have two tasks each needing five seconds of CPU time to complete, then you'll need ten seconds of CPU time regardless of whether you do them sequentially or in parallel. Setting up multiple threads/processes will, therefore, provide no benefit, but it will create additional task-switching overhead while also preventing you from having any results until all results are available.
CPU-bound tasks on a multi-processor system tend to do better when run in parallel, provided that they can run independently of each other. If not, or if you're using a language/threading model/IPC model/etc. which forces all tasks to run on the same processor, then see "on a single processor" above.
Parallel processing is generally better for tasks which are I/O-bound, regardless of the number of processors available, because CPUs are fast and I/O is slow, so working in parallel allows one task to process its data while the other is waiting for I/O operations to complete. (This is why make -j2 tends to be significantly faster than a plain make, even on single-processor machines.)
But, again, these are all generalities and all have cases where they'll be incorrect. Only benchmarking will reveal the truth with certainty.
Perl threads are an extreme suck. You are better off in every case forking several processes.
When you create a new thread in perl, it does the following:
Make a copy - yes, a real copy - of every single perl data structure in scope, including those belonging to modules you didn't write
Start up what is almost a new, independent instance of perl in a new OS thread
If you then want to share anything (as it has now copied everything) you have to use the share function in the threads module. This is incredibly sucky, as it replaces your variable, with some tie() nonsense, which adds much-too-fine-grained locking around it to prevent concurrent access. Accessing a shared variable then causes a massive amount of implicit locking, and is incredibly slow.
So in short, perl threads:
Take a long time to start
waste loads of memory
Cannot share data efficiently anyway.
You are much better off with fork(), which does not copy every variable (the kernel does copy-on-write) unless you're on Windows.
There's no reason to assume that in a single CPU core system, parallel processing will be faster.
Consider this png example:
The red and blue lines at the top represent two tasks running sequentially on a single core.
The alternate red and blue lines at the bottom represent two task running in parallel on a single core.