How to properly define and differentiate between nodes, processes, transactions & operations? - distributed-computing

As part of my research I need to provide the reader with a comprehensive introduction to distributed systems. I am currently struggling with properly defining a number of the concepts that are recurring in literature on distributed systems and transactions. These are (a) nodes, (b) processes, (c) transactions and, (d) operations. I could really use some help in understanding their correlation, as I seem to continuously mix up nodes with processes and transaction with operations. Any input is appreciated!
I have already tried to grasp these concepts by researching the following literature:
Distributed Systems: Concepts and Design (G. Coulouris et al.)
A brief introduction to distributed systems (A.S. Tannenbaum)

I'm not sure what type of the ambiguity you exactly perceive in the defined terms and thus it's harder to put the right answer. These terms have the same meaning in the distributed systems terminology as any other part of the information technology science.
To be more concrete.
The node is usually "a machine" which runs one or multiple processes. The process executes operations. Operations may be grouped in a transaction (the transaction is composed from operations).
I just quickly searched in the resources you referred and there is said
A computing element, which we will generally refer to as a node, can
be either a hardware device or a software process.
The node runs processes. But the node itself can be a real hardware (a machine) or it could be a virtual machine - which is a process that runs on some machine (a real hardware).
From distributed system perspective you don't mind what the node is in reality (it's real as the HW or it's virtual as the SW) but it's a "container" for running processes.
Process is "a runtime". It processes something. It can process numbers, data, messages... The chunks of the work that is processed inside of the process are operations. E.g. you save data to a database and you do it as an operation.
The transaction defines a unit of work which consists of several operations. The transaction brings you guarantees over those operations. What are those guarantees depend on model you use. If you think about ACID transactions (as defined in paper Principles of Transaction-Oriented Database Recovery from 1983) then you are guaranteed that the all operation are successfully process or no of them is(A), consistency is maintained(C), parallel transactions do not interfere(I) and you are guaranteed that transaction outcome is persistent(D).

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Are all distributed database designed to process data in parallel?

I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

How can communication middleware support soft real-time applications?

Nowadays the concept "real-time" has a lot different interpretations. In this question two definitions are provided:
The hard real-time definition considers any missed deadline to be a system failure. This scheduling is used extensively in mission critical systems where failure to conform to timing constraints results in a loss of life or property.
and
The soft real-time definition allows for frequently missed deadlines, and as long as tasks are timely executed their results continue to have value. Completed tasks may have increasing value up to the deadline and decreasing value past it.
In my research I came to the following conclusions:
The middleware supports hard real-time if it provides predictable and efficient end-to-end control over system resources. Like setting the thread-priority of all the threads created by the middleware.
It appears to me that good performance is the most relevant factor to support soft real-time applications.
Is this true? Are other relevant features of communication middlewares which support soft real-time applications?
First, for precise definitions of real-time principles and terms, based on first principles and mental models, I refer you to real-time.org.
The real-time practitioner computing community uses a variety of inconsistent and incomplete "definitions" of "real-time," "hard real-time," and "soft real-time." The real-time computing research community has a consensus on "hard real-time" but is confused about "soft real-time."
The core of the research community's "hard" real-time computing model is that tasks have hard deadlines, and all these deadlines must not be missed, else the system has failed. Meeting the deadlines is the "timeliness" criterion, and guaranteeing that all deadlines will be met is the "predictability" criterion--that predictability is "deterministic."
(In some of these models, tasks without deadlines are allowed in the background if they do not interfere with the hard real-time tasks; they usually also are prevented from being starved.)
This model requires everything related to the hard real-time tasks to be static (known in advance)--i.e., it requires that the time evolution of the system is known in advance. This requirement is very strong, and in most cases, it is not feasible. There are important hard real-time systems in which this requirement is (at least presumed) to be satisfied. Well-known examples include digital avionics flight control, certain medical devices, power plant control, railroad crossing control, etc. These examples are safety-critical, but not all hard real-time systems are (and we will see below that most safety-critical systems are not and cannot be hard real-time, although some may include simple low level hard real-time components).
Soft real-time refers to a class of real-time systems which are generalizations of hard real-time ones. The generalizations included weaker timeliness criteria and/or weaker predictability criteria.
For example, consider a model with tasks having deadlines as hard real-time ones do. In this particular model, the timeliness criterion is that any number of tasks are allowed to be up to 15% tardy, and the predictability criterion is that this must be guaranteed (i.e., deterministic) just like for hard real-time systems. If one or more of these tasks is more than 15% tardy, the system has failed.This model is not a conventional hard real-time one, although it may be a safety-critical one.
Consider another model: the timeliness criterion is that no more than 20% of the tasks can be more than 5% tardy, and the predictability criterion is that this is guaranteed to be satisfied with at least probability 0.9. Violation of the timeliness and/or predictability criteria means the system has failed.This is not a hard real-time system, although it may be a safety-critical one.
But consider: what if the utility of that system degrades according to not meeting one or any of those criteria--say, 23% were more than 5% tardy, or less than 20% of the tasks were tardy but 10% of those were more than 5% tardy, or all of the criteria were met except that the predictability is only 0.8. There are many real-time systems having such dynamic properties.
We need to specify how that system degradation (say, the system's "utility" or "value") is related to how many and to what degree any of those timeliness and predictability criteria were or were not met. In fact, this model is a notional example of many actual existing real-time systems that are as safety-critical as possible--for example, for doing defense against nuclear armed hostile missiles (and numerous other military combat systems, because they all have various inherent dynamic uncertainties).
Now we return to that need to specify how a real-time system's timeliness and predictability are related to the system's utility. A successfully used solution to that is called "time/utility functions," (or "time/value functions") and is described in great detail on real-time.org. The functions for each task are derived from the physical nature of the system application(s). The system's timeliness and predictability of timeliness are based on those of the tasks--for example, by weighted accrual of their individual utilities.
Soft real-time systems (in the precisely defined sense described on real-time.org) are the general case, and hard real-time systems are a special case that applies to a much smaller domain of real-time problems. All hard and soft real-time systems can be specified and created with time/utility functions.
All that clarified, now we can address your question about real-time middleware.
One obvious source for an answer is The Open Group Real-Time CORBA (RTC) standard (Google, there is a GREAT deal of detailed information available).
RTC can be implemented as a fixed priority infrastructure, with a 15-bit system-wide priority that is mapped onto the node priorities. In that case, the minimum requirements are: respecting thread priorities between client and server for resolving resource contention during the processing of CORBA invocations; bounding the duration of thread priority inversions during end-to-end processing; bounding the latencies of operation invocations. It is possible to build hard real-time RTC distributed systems according to those three requirements (and many exist)--but obviously the underlying network QoS affects the real-time behavior. So RTC provides for pluggable application-specific networking, such as those having deterministic QoS (so hard real-time is possible at and below the RTC layers), and those having non-deterministic QoS (but still the RTC layers have the three essential fixed priority real-time properties).
More generally, RTC provides for soft real-time (in the technical sense defined on real-time.org) at the CORBA layers. It does that by providing a first order scheduling abstraction called "distributed threads." And it provides a scheduling framework that supports not only fixed priorities but also time/utility functions, which are general enough to express a very general class of "utility accrual" soft real-time scheduling algorithms. Such algorithms (or usually heuristics) are needed for distributed systems consisting of application-specific soft real-time system models such as I described above.
What if you don't want to use RTC? The good news is that RTC's principles first appeared publicly in a different distributed real-time system (described on real-time.org), and can be (and have been) transplanted to other real-time middleware for both hard and soft real-time systems.
For soft real-time (again, in the precisely defined sense from real-time.org) middleware, the principles of dynamic timeliness and predictability of timeliness must be applied to resource management at each node of the middleware's system--including being applied to scheduling the middleware's network (e.g., access, routing, etc.). Instances of this approach appear in several Ph.D. theses, and have also been implemented in a number of military combat distributed real-time time systems.

Difference between centralized and distributed computing

Can anyone tell me the differences between centralized and distributed computing?
Centralized
A system with centralized multiprocessor parallel architecture.In the late 1980 s Centralized systems have been progressively replaced by distributed systems.
characteristics of centralized system
Non autonomous components
usually homogeneous technology
Multiple users share the same resources at all time
single point of control
single point of failure
Distributed
set of tightly coupled programs executing on one or more computers which are interconnected through a network and coordinating their actions. These programs know about one another and carry out tasks that none could carry out in isolation
characteristics of distributed system
autonomous components
Mostly build using heterogeneous technology
System components may be used exclusively
Concurrent processes can execute
Multiple point of failure
Requirement of distributed system
Scalability- possibility of adding new hosts
openness- easily extended and modified
Heterogeneity-supports various H/W S/w platforms
Resource sharing- H/w, S/W and data
fault tolerance- ability to function correctly even if faults occur
Centralized: all calculations are done on one particular computer (system). Example: you have a dedicated server for calculating data.
Distributed: the calculation is distributed to multiple computers. Example: when you have a large amount of data then you can divide it and send each part to particular computers which will make the calculations for their part.
Main basic differences are:
distrib-systems have no global state
no shared memory
no shared variables
distrib-systems have no shared time clock
therefore order of events is difficult
distrib-systems can have race conditions
race conditions see http://en.wikipedia.org/wiki/Race_condition
So "computing" in a distrubuted environment is very difficult. Do you have concret question about programing models or whatever?
Centralized Systems
"In Centralized Systems,several jobs are done on a particular central processing unit(CPU)"
Distributed Systems
"In Distributed Systems,jobs are distributed among several processor.The Processor are interconnected by a computer network"
(Sheheryar ,NUML)
Briefly, Centralized computing, as the name itself depicts, is concerned with just a single server. The particular operation is being held at this server location and nowhere else.
Distributed computing is held where the system requirement is quite large, and the job is distributed to several processors and the solutions are then combined together, keeping in mind that the processors are interconnected by a computer network.
centralized system:is a system which computing is done at central location using terminals attached to central computer in brief (mainframe and dump terminals all computation is done on the mainframe through terminals )
distributed system:is a collection of independent computers that appear to its users as single coherent system where hardware is distributed consisting of n processing elements (processor and memory )also software is distributed where no centralized os each processing element has its own os ,no physically centralized file system and inter-process communication via message passing at lowest level
Big Note:the main differences is reliability. in distributed system if one machine crashes,the system as a whole can still survive
METHOD OF ARBITRATION In all but the simplest systems, more than one module may
need control of the bus.
In a centralized scheme, a single hardware device, referred to as a bus controller or arbiter, is responsible for allocating time on the bus.
In a distributed scheme, there is no central controller. Rather, each module contains access
control logic and the modules act together to share the bus.
in centralized system in case the server fails it affects the whole system because the server controls the whole operation
in D.S system incase a system fails it doesn't affect the operations of the other computers because they are independent and distributed in operations
Let us try to understand this with an example.
Say you are carrying a large amount of money. You are in a crowded train, where your pocket may be picked and you might lose money. What is the ideal strategy for carrying money?
Put all money in a single pocket: In this case, it is easy for you to just put the money in the pocket and be done. When you go back home, you can simply take out money from the pocket and count it. But wait. What if your pocket is picked? You lose ALL the money (bankrupt? eh!). Seems like it is not the best idea to store all the money in a SINGLE pocket. Let us think what else we can do
Divide your money: Put some of it in the left pocket, put some in the right pocket and maybe put some in your bag (which has a limited capacity). You need to devise a strategy to divide the money with you. Also, when you go back home, you will have to spend time collecting money from different pockets and collecting it at one place. However, we are in a better situation now. If one of our pocket (or bag) is picked, we do not lose ALL of the money. The chances of your bag, left pocket and the right pocket, all being picked is fairly low. With a little overhead of dividing money, you can now avoid losing all of your money. Isn’t that better?
This is how distributed systems work. They divide the information (money in your case) and keep it on different machines (pockets and bags for us). This way if one of the machine goes down, we are not at a big loss. That is, we do not have a single point of failure
Another important thing that distributed systems implement is data replication. They put replicas of same information in multiple machines. This way, if one of the machines goes down, we do not lose the information. So, we now have something called as fault tolerance.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.