Multi-Level Serving Tree vs Directed Acyclic Graph - distributed-computing

At "Big Data Concepts, Theories, and Applications", I found a brief decription of MLST and DAG. Still the description doesn't provide an answer to WHY one distributed systems use MLST while other use DAG? What does that mean in terms of performance/latency/multi-user/etc.?
I noticed that DAG is used by distributed computational engines (i.e. Apache Spark, Storm, Tez) while MLST is used by interactional distributed databases (i.e. Google Dremel, Apache Impala, Presto). Is this always true?

Related

Real-time processing: Storm / flink vs standard application (java, c#...)

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:
an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka
I find it difficult to see which of the scenarios is the most interesting.
Someone could help me on this topic ?
It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:
Choosing a processing framework over a native implementation gives you the following advantages:
Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).
The big downsides are usually:
Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Distributed Streaming Applications (Graph and Transaction) In Reallife

I am currently studying distributed stream processing systems, e.g. Storm, Flink and Spark Streaming. I want to implement some applications in these systems and briefly compare them. I wonder if there is any company using these systems for processing the following situations and what's the scale of the data stream.
Graph where a large graph may be distributed to multiple machines and we handles some updates (add or remove vertices or edges) and queries on the graph. So far I can only find some streaming graph algorithms on a single machine.
Transaction where exactly-once message delivery is a must. There exists a Leaderboard Maintenance Benchmark in S-Store (Meehan, John, et al. "S-store: Streaming meets transaction processing." Proceedings of the VLDB Endowment 8.13 (2015): 2134-2145) but I cannot find how they generated the input data.
have a look at yahoo's benchmark for comparing Apache Storm, Apache Spark, Apache Flink and Apache Apex etc

How good are ZooKeeper and Etcd?

Disclaimer: I'm quite new for the etcd project and ZooKeeper project.
I'm recently getting interested in the distributed open source products.
I found they seems to require configuration(coordination?) systems such as ZooKeeper for Presto DB, Hive and Etcd for kubernetes and I think that understanding the role of etcd and ZooKeeper is the first step to understand the distributed systems.
But now, I feel like getting lost... I could not yet understand what is the good and unique points of the etcd and ZooKeeper. They look for me a well-distributed key-value storage or file systems.
Here is the impression that I have for the products. I know the impressions don't reflect the feature of the products. but I don't know what is the remaining feature that I should know.
ZooKeeper: According to the overview page of ZooKeeper, it guarantees the following things.
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
The sequential consistency and atomicity are the unique features which is not supported by most file systems but others are common among other file systems.
Etcd: According to the README of etcd. it focuses on
Simple: curl'able user-facing API (HTTP+JSON)
Secure: optional SSL client cert authentication
Fast: benchmarked 1000s of writes/s per instance
Reliable: properly distributed using Raft
Most of them seems common with Amazon S3 (S3 doesn't support such a fast access.)
I know those products are very good ones because most of the distributed open source products depend on them. but what is the key, unique feature that the distributed open source product choose them?
I think you're confusing the file-system-like interface with an actual file system. The systems you are mentioning are well suited for cluster coordination, in particular ZooKeeper. What they are not designed for is storing large amounts of data like a file system would. You should think of them more as suited for coordinating a file system. That is, one could imagine a file system storing paths to files in a consistent store like ZooKeeper or etcd, but not the files themselves. That they expose a file system-like interface does not correlate to any ability to store files. Indeed, these systems are designed to store small amounts of data that can be held in memory. By using a consistent store like ZooKeeper for storing file information in a distributed file system, the file system would ensure that clients see changes in the file system in sequential order.
ZooKeeper is really a set of primitives with which distributed systems can be coordinated. Particularly relevant to coordinating distributed systems with ZooKeeper are its session events (watches) which allow clients to listen for changes to the cluster state. Distributed systems typically use watches in ZooKeeper for things like locks, and the strong consistency guarantees of ZooKeeper make it perfectly suitable for that use case.
If you want a good idea of what systems like ZooKeeper and etcd are used for, you should check out the Apache Curator recipes. Atomix also implements similar types of APIs for coordinating distributed systems on top of a consensus algorithm. All of these tools are demonstrative of typical use cases for consensus-based distributed systems.
What's important to note is that these types of systems are built on top of consensus algorithms and usually store state in memory. They're suitable for operations that involve a small amount of data but require a high level of consistency, and that's why they're frequently used for things like distributed locking, configuration management, and group membership.

what are the main differences between TORQUE, HTCondor and Apache Mesos

http://www.adaptivecomputing.com/products/open-source/torque/
https://research.cs.wisc.edu/htcondor/
I am looking for a program to perform distributed computing (no parallel computing needed though) which has:
a scheduler
a queue management (FIFO, or preferably something more advanced)
a good statistics report
ability to run on a heterogeneous cluster (a set of machines with different characteristics such as cpu and memory)
and very important: a good responsivness (a few seconds maximum between the trigger of the task and the actual start of the execution: I have heard that this may be tricky to achieve with HTCondor and TORQUE? What about Apache Mesos?)
There is a quite large wikipedia page with comparisons, but you will hardly find large differences. My guess would be that most things could theoretically be done in either framework. The things you list all depend on perspective (people e.g. commonly write their own sophisticated statistics from HTCondor logs). Regarding responsiveness: HTCondor works fine to schedule interactive notebooks if there are enough ressources for the workers to pick up the job. Few seconds is often no problem, but there are hardly guarantees. These are High Throughput Systems, but not low-latency systems. You should preallocate workers and scale them up and down if you care for latency (here supports for other frameworks on top helps much more than native latency).
I try my best to highlight the main foci of each Project from my perspective, that are important for a practical decision:
Target audience
Mesos:
PaaS/IaaS targeted to run other schedulers (you can run Torque on top of Mesos)
particularly interop with big data frameworks such as Spark & Kafka
vs.
Both HTCondor & Torque:
fair-share batch processing particularly in scientific clusters (High Throughput Computing)
Eco-system
Mesos:
Apache open source project with community
vs.
HTCondor:
Open Source maintained by UW-Madison with classical user mailing-list
vs.
TORQUE:
Proprietary, Commercial support
Ease of use
(partially this is statistics, but more the dashboard style)
Mesos & TORQUE:
Web UI
commonly integrations with other frameworks available (for TORQUE look for PBS)
HTCondor:
new, developing REST and python interaces but no common GUI
lagging behind a tiny bit in framework support (R batchtools, lately is has had dask support)

I am looking for a key-value datastore which has following (preferable) properties

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.
Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.