what are the main differences between TORQUE, HTCondor and Apache Mesos - scheduler

http://www.adaptivecomputing.com/products/open-source/torque/
https://research.cs.wisc.edu/htcondor/
I am looking for a program to perform distributed computing (no parallel computing needed though) which has:
a scheduler
a queue management (FIFO, or preferably something more advanced)
a good statistics report
ability to run on a heterogeneous cluster (a set of machines with different characteristics such as cpu and memory)
and very important: a good responsivness (a few seconds maximum between the trigger of the task and the actual start of the execution: I have heard that this may be tricky to achieve with HTCondor and TORQUE? What about Apache Mesos?)

There is a quite large wikipedia page with comparisons, but you will hardly find large differences. My guess would be that most things could theoretically be done in either framework. The things you list all depend on perspective (people e.g. commonly write their own sophisticated statistics from HTCondor logs). Regarding responsiveness: HTCondor works fine to schedule interactive notebooks if there are enough ressources for the workers to pick up the job. Few seconds is often no problem, but there are hardly guarantees. These are High Throughput Systems, but not low-latency systems. You should preallocate workers and scale them up and down if you care for latency (here supports for other frameworks on top helps much more than native latency).
I try my best to highlight the main foci of each Project from my perspective, that are important for a practical decision:
Target audience
Mesos:
PaaS/IaaS targeted to run other schedulers (you can run Torque on top of Mesos)
particularly interop with big data frameworks such as Spark & Kafka
vs.
Both HTCondor & Torque:
fair-share batch processing particularly in scientific clusters (High Throughput Computing)
Eco-system
Mesos:
Apache open source project with community
vs.
HTCondor:
Open Source maintained by UW-Madison with classical user mailing-list
vs.
TORQUE:
Proprietary, Commercial support
Ease of use
(partially this is statistics, but more the dashboard style)
Mesos & TORQUE:
Web UI
commonly integrations with other frameworks available (for TORQUE look for PBS)
HTCondor:
new, developing REST and python interaces but no common GUI
lagging behind a tiny bit in framework support (R batchtools, lately is has had dask support)

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Real-time processing: Storm / flink vs standard application (java, c#...)

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:
an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka
I find it difficult to see which of the scenarios is the most interesting.
Someone could help me on this topic ?
It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:
Choosing a processing framework over a native implementation gives you the following advantages:
Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).
The big downsides are usually:
Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Perl Distributed parallel computing

I would like to know if there are any perl modules available to enable distributed parallel computation similar to apache hadoop.
Example,
A perl script to be executed in many machines parallely when submitted to a client node.
I'm the author of the Many-core Engine for Perl.
During the next several weekends, I will take MCE for a spin with Gearman::XS. MCE is good at maximizing available cores on a given node. Gearman is good at job distribution and includes a lot of features such as load balancing. Combining the two together was my thought for scaling MCE horizontally across many nodes. :) I did not share this news with anybody until just now.
Why are the two modules a good fit (my humble opinion):
For distribution, one needs some sort of chunking engine. MCE is a chunking engine -- so breaking up input is natural to MCE. Essentially MCE can be used at both sides, the job submission host as well as on the worker node for maximizing available cores.
For worker nodes, MCE follows a bank-queuing model when processing input data. This helps guarantee that all CPUs remain busy from the start of the job till the very end. As workers being to idle down, the remaining "working" are processing their very last chunk.
One's imagination is the limit here -- there are so many possibilities with these 2 modules working together. When writing MCE, I first focused on the node side. Job distribution is next obviously and I did a search and came across Gearman::XS. The 2 modules can chunk happily together :) Job distribution side (bigger chunks), once on node (smaller chunks). All the networking stuff is handled by Gearman.
Basically, there's no need for me to write the job distribution aspect when Gearman::XS is already quite nice. This has been my plan. I will write about Gearman::XS + MCE soon.
BTW: Folks can do similar things with GRID-Machine + MCE I imagine. MCE's beauty is on maximizing all available cores on any given node.
Another magical thing about MCE is that one may not want 200 nodes * 16 workers all reading/writing from/to the NFS server for example. That will impact the NFS server greatly. BTW: RHEL 6.4 will include pNFS (parallel NFS). With MCE, workers can call the "do" method to serialize writes/reads from NFS. So instead of 200 * 16 = 3200 attacking NFS, it becomes just 200 maximum requests against the NFS server at any given time (1 per physical node).
When writing MCE, grace can be applied for many scenarios. I need to add more wikis to MCE's home page MCE at code.google.com. In addition, MCE eats really big log files for breakfast :) Check out egrep.pl and wc.pl under the examples dir. It even beats the wide finder project with sequential IO (powerful slurp IO among many workers).
Check out the images included with the MCE distribution. Oh, do not forget to check out the main Gearman site as well.
What's left after this? Humm, the web piece. One idea which comes to mind is to use Mojo. There are many options. This is just one:
Gearman::XS + MCE + Mojolicious
Again, one can use GRID-Machine instead of Gearman::XS if wanting to communicate through SSH.
Anyway, that was my plan to use an already available job distribution module. For MCE, my focus was on maximizing performance on a single node -- to include chunking, serializing, bank-queuing model, user tasks (allows for many roles), number sequencing among workers, and sequential slurp IO.
-- mario
You might look into something as simple as a message queue like ZeroMQ. I'm sure a CPAN search could help with some other suggestions.
Recently there has been some talk of the Many Core Engine MCE module, which you might want to investigate, I don't know for sure that it lets you parallelize off the host computer, but it seems like it wouldn't be a big step given its stated purpose.
Argon may provide what you are looking for (disclaimer - I'm the author). It allows you to set up an arbitrary network of workers, each of which runs a process pool (using Coro::ProcessPool).
Creating a task is pretty simple:
use Argon::Client;
my $client = Argon::Client->new(host => "somehost", port => 8000);
my $result = $client->queue(sub {
use My::Work::Module qw(do_work);
my $task_id = shift;
do_work($task_id);
});
The GRID module on CPAN is designed for working with distributed computing.
https://metacpan.org/pod/distribution/GRID-Machine/lib/GRID/Machine/perlparintro.pod

Distributing work to multiple cores: Hadoop or Scala's parallel collections?

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.