Perl Distributed parallel computing - perl

I would like to know if there are any perl modules available to enable distributed parallel computation similar to apache hadoop.
Example,
A perl script to be executed in many machines parallely when submitted to a client node.

I'm the author of the Many-core Engine for Perl.
During the next several weekends, I will take MCE for a spin with Gearman::XS. MCE is good at maximizing available cores on a given node. Gearman is good at job distribution and includes a lot of features such as load balancing. Combining the two together was my thought for scaling MCE horizontally across many nodes. :) I did not share this news with anybody until just now.
Why are the two modules a good fit (my humble opinion):
For distribution, one needs some sort of chunking engine. MCE is a chunking engine -- so breaking up input is natural to MCE. Essentially MCE can be used at both sides, the job submission host as well as on the worker node for maximizing available cores.
For worker nodes, MCE follows a bank-queuing model when processing input data. This helps guarantee that all CPUs remain busy from the start of the job till the very end. As workers being to idle down, the remaining "working" are processing their very last chunk.
One's imagination is the limit here -- there are so many possibilities with these 2 modules working together. When writing MCE, I first focused on the node side. Job distribution is next obviously and I did a search and came across Gearman::XS. The 2 modules can chunk happily together :) Job distribution side (bigger chunks), once on node (smaller chunks). All the networking stuff is handled by Gearman.
Basically, there's no need for me to write the job distribution aspect when Gearman::XS is already quite nice. This has been my plan. I will write about Gearman::XS + MCE soon.
BTW: Folks can do similar things with GRID-Machine + MCE I imagine. MCE's beauty is on maximizing all available cores on any given node.
Another magical thing about MCE is that one may not want 200 nodes * 16 workers all reading/writing from/to the NFS server for example. That will impact the NFS server greatly. BTW: RHEL 6.4 will include pNFS (parallel NFS). With MCE, workers can call the "do" method to serialize writes/reads from NFS. So instead of 200 * 16 = 3200 attacking NFS, it becomes just 200 maximum requests against the NFS server at any given time (1 per physical node).
When writing MCE, grace can be applied for many scenarios. I need to add more wikis to MCE's home page MCE at code.google.com. In addition, MCE eats really big log files for breakfast :) Check out egrep.pl and wc.pl under the examples dir. It even beats the wide finder project with sequential IO (powerful slurp IO among many workers).
Check out the images included with the MCE distribution. Oh, do not forget to check out the main Gearman site as well.
What's left after this? Humm, the web piece. One idea which comes to mind is to use Mojo. There are many options. This is just one:
Gearman::XS + MCE + Mojolicious
Again, one can use GRID-Machine instead of Gearman::XS if wanting to communicate through SSH.
Anyway, that was my plan to use an already available job distribution module. For MCE, my focus was on maximizing performance on a single node -- to include chunking, serializing, bank-queuing model, user tasks (allows for many roles), number sequencing among workers, and sequential slurp IO.
-- mario

You might look into something as simple as a message queue like ZeroMQ. I'm sure a CPAN search could help with some other suggestions.
Recently there has been some talk of the Many Core Engine MCE module, which you might want to investigate, I don't know for sure that it lets you parallelize off the host computer, but it seems like it wouldn't be a big step given its stated purpose.

Argon may provide what you are looking for (disclaimer - I'm the author). It allows you to set up an arbitrary network of workers, each of which runs a process pool (using Coro::ProcessPool).
Creating a task is pretty simple:
use Argon::Client;
my $client = Argon::Client->new(host => "somehost", port => 8000);
my $result = $client->queue(sub {
use My::Work::Module qw(do_work);
my $task_id = shift;
do_work($task_id);
});

The GRID module on CPAN is designed for working with distributed computing.
https://metacpan.org/pod/distribution/GRID-Machine/lib/GRID/Machine/perlparintro.pod

Related

what are the main differences between TORQUE, HTCondor and Apache Mesos

http://www.adaptivecomputing.com/products/open-source/torque/
https://research.cs.wisc.edu/htcondor/
I am looking for a program to perform distributed computing (no parallel computing needed though) which has:
a scheduler
a queue management (FIFO, or preferably something more advanced)
a good statistics report
ability to run on a heterogeneous cluster (a set of machines with different characteristics such as cpu and memory)
and very important: a good responsivness (a few seconds maximum between the trigger of the task and the actual start of the execution: I have heard that this may be tricky to achieve with HTCondor and TORQUE? What about Apache Mesos?)
There is a quite large wikipedia page with comparisons, but you will hardly find large differences. My guess would be that most things could theoretically be done in either framework. The things you list all depend on perspective (people e.g. commonly write their own sophisticated statistics from HTCondor logs). Regarding responsiveness: HTCondor works fine to schedule interactive notebooks if there are enough ressources for the workers to pick up the job. Few seconds is often no problem, but there are hardly guarantees. These are High Throughput Systems, but not low-latency systems. You should preallocate workers and scale them up and down if you care for latency (here supports for other frameworks on top helps much more than native latency).
I try my best to highlight the main foci of each Project from my perspective, that are important for a practical decision:
Target audience
Mesos:
PaaS/IaaS targeted to run other schedulers (you can run Torque on top of Mesos)
particularly interop with big data frameworks such as Spark & Kafka
vs.
Both HTCondor & Torque:
fair-share batch processing particularly in scientific clusters (High Throughput Computing)
Eco-system
Mesos:
Apache open source project with community
vs.
HTCondor:
Open Source maintained by UW-Madison with classical user mailing-list
vs.
TORQUE:
Proprietary, Commercial support
Ease of use
(partially this is statistics, but more the dashboard style)
Mesos & TORQUE:
Web UI
commonly integrations with other frameworks available (for TORQUE look for PBS)
HTCondor:
new, developing REST and python interaces but no common GUI
lagging behind a tiny bit in framework support (R batchtools, lately is has had dask support)

Why is Akka good for scaling "up" and "out"?

If you Google "what does Akka do", the typical sales pitches you get is that it helps your program scale "up" and/or scale "out". But just like the buzzword "cloud" does nothing to explain the virtualization technologies that comprise a cloud service, I see "scale up/out" as equally-vague buzzwords that probably don't do Akka any real justice.
So let's say I've got a batch processing system full of 100 different types of tasks. Task 1 - 100 are kicking off all day long, doing their thing, whatever it is that they do. How exactly might Akka help me batch system scale "up"? How might it help my system scale "out"?
It scales "out" because it allows you to design and organize cluster of servers. Being message-passing-based, it is pretty much a one-to-one representation of the actual world (machines connected via the network and sending messages to each other). No magic here, it's just that the paradigm of the framework makes it easier to reason about your infrastructure.
It scales "up" because if you buy better hardware it will transparently take advantage of the newly added cores/cpus without you having to change anything.
(When it comes to the Typesafe stack, get used to the buzzword! :) )
Edit after first comment:
You could organize your cluster the way you want :)
Dividing by type/responsibility seems like a good option yes. You could have VM1 with Task1Actor instances, VM2 with Task2Actor instances and if you notice that task 1 is the bottleneck start VM1-bis to add more instances for example.
Since Akka abstracts the whole process of sending/receiving message you can have several JVMs on the same machine, several VMs on the same physical machine, several actual machines, several actual machines with several VMs with several JVMs. You get the idea.
For the Typesafe stack: http://typesafe.com/platform

Distributing work to multiple cores: Hadoop or Scala's parallel collections?

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

Benefits of multiple memcached instances

Is there any difference between having 4 .5GB memcache servers running or one 2GB instance?
Does running multiple instances offer any benifits?
If one instance fails, you're still get advantages of using the cache. This is especially true if you are using the Consistenthashing that will bring the same data to the same instance, rather than spreading new reads/writes among the machines that are still up.
You may also elect to run servers on 32 bit operating systems, that cannot address more than around 3GB of memory.
Check the FAQ: http://www.socialtext.net/memcached/ and http://www.danga.com/memcached/
High availability is nice, and memcached will automatically distribute your cache across the 4 servers. If one of those servers dies for some reason, you can handle that error by either just continuing as if the cache was blank, redirecting to a different server, or any sort of custom error handling you want. If your 1x 2gb server dies, then your options are pretty limited.
The important thing to remember is that you do not have 4 copies of your cache, it is 1 cache, split amongst the 4 servers.
The only downside is that it's easier to run out of 4x .5 than it is to run out of 1x 2gb memory.
I would also add that theoretically, in case of several machines, it might save you some performance, as if you have a lot of frontends doing a lot of heavy reads, it's much better to split them into different machines: you know, network capabilities and processing power of one machine can become an upper bound for you.
This advantage is highly dependent on memcache utilization, however (sometimes it might be ways faster to fetch everything from one machine).