I am reading Spring Cloud and NetFlix APIs. Many places, I read Fault Tolerance and Fault Resilience keyword.
Please explain the difference.
Fault tolerance: User does not see any impact except for some delay during which failover occurs.
Fault resilience: Failure is observed in some services. But rest of system continues to function normally.
The Fault Tolerant means the ability of an architecture to survive (tolerate) when an environment misbehaves by taking corrective actions, e.g, surviving a server crash or preventing a misbehaving API from bringing down the whole system, etc. The Fault Resilience is probably the capacity to recover from these type of scenarios quickly.
After further reading of Netflix blogs and wikis, it seemed the terms Fault Resilience and Fault Tolerant were used interchangeably.
Fault Tolerance: any user of the service does not observe any fault (observing delays is normal).
Fault Resilience: a fault may be observed, but only in uncommitted data (like the database may respond with an error to the attempt to commit a transaction, etc.).
[Reference]
Related
Cadence is a fault tolerant stateful code platform. How does cadence handle fault in various failure condition?
There are al kinds of failures in distributed systems and Cadence provides various options to them.
Here is the list from myself. It may not be complete. But I will try add more if I can think of.
activity
Activity failure and retry. See https://cadenceworkflow.io/docs/concepts/activities/#timeouts
Also note that long running activity can recover from checkpoints via “heartbeat “
workflow
By design of event sourcing models, a workflow can recover from any point left when a worker crashed. See https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism
Workflow can also have retry policy like activity to retry on failure automatically https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
On certain scenarios the failure is caused by bad code change which leads to wrong states. Cadence provides “reset” tool to reset workflow to any point of time.
See https://cadenceworkflow.io/docs/cli/#reset-and-restart
On top of reset, Cadence also allows you to reset by deployment. This is useful to reset a big number of workflow(eg millions of).
Cadence server cluster
Both activity and workflow workers are stateless.
Cadence server is a highly available and scalable service provides the durability.
The durability is from underlying design and persistence storage ( by either Cassandra, MySQL or Postgres)
In a single cluster setup, Cadence service is running with different independent shards. The whole cluster consists of different hosts. Any failed host can be replaced by another.
Cadence provides Cross data center replication to provide much higher availability https://cadenceworkflow.io/docs/concepts/cross-dc-replication/#global-domains-architecture
I was trying to wrap my brain around the CAP theorem. I understand that Network partitions can occur (eventually leading to the nodes in the cluster not able to sync up with the WRITE operations happening on the other nodes.)
In this case, either the Cluster could still be up and the load-balancer in front of the cluster could route the request to any of the nodes and after a WRITE operation on one of the nodes, the other nodes who can't sync with that data, still have STALE data and any subsequent READS to these nodes will serve STALE data.
[So we are Loosing CONSISTENCY as we choose AVAILABILITY (i.e., we have choose the cluster to give STALE responses back.)]
Or we could SHUTDOWN the cluster whenever a network partition occurs! (There by loosing AVAILABILITY as we don't want to hamper consistency among the nodes.)
I have 2 things I would like to know the answer for it:
In Reality, When would anyone choose to be AVAILABLE and still trade off CONSISTENCY? Who on this earth (practically) would be interested in STALE data?
Please help me understand by listing more than one scenarios.
In case, we would like to choose CONSISTENCY over AVAILABILITY,
the cluster is down. Who on earth (real-time scenarios) practically would accept to design their system to be DOWN in order to preserve CONSISTENCY.
Please list some scenarios.
Won't majority of us look for High availability no matter what? what are our options? please enlighten.
If I send you a message on FB and you send one to me, I'd rather prefer to see messages in an incorrect order(message sent at 1pm comes before message sent at 2pm) rather than not seeing them at all(example of AVAILABILITY of messages prefered over read-after-write CONSISTENCY of messages). Another example, If I gather web site metrics, I'd rather skip or drop some signal rather then force my users to wait for a page load while my consistent transaction is stuck.
Keep in mind that consistency doesn't mean STALE data, also data can be inconsistent in different ways(https://aphyr.com/posts/313-strong-consistency-models)
Financial transactions are a classic example of data that requires consistency over availability. As a bank, I'd rather decline user request for money transfer, than accept it and lose customer's money due to DB being down.
I'd like to point out that CAP theorem is a high-level concept. There are a lot of ways you can treat terms consistency, availability or even partitioning, and different businesses have different requirements. Software engineering as a whole and distributed systems engineering, in particular, is about making trade-offs.
An example where you may choose Availability over Consistency is collaborative editing (e.g. Google Docs). It may be perfectly acceptable (and in fact desirable) to allow users to make local modifications to the documents and deal with conflict resolution once network is restored.
A bank ATM is an example where you'd choose Consistency over Availability. Once ATM is disconnected from the network you would not want to allow withdrawals (thus, no Availability). Or, you could pick partial Availability, and allow deposits or read-only access to your bank statements.
Why is Kafka pull-based instead of push-based? I agree Kafka gives high throughput as I had experienced it, but I don't see how Kafka throughput would go down if it were to pushed based. Any ideas on how push-based can degrade performance?
Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.
The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.
Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.
Refer to the Kafka documentation which details the particular design decision: Push vs pull
Major points that were in favor of pull are:
Pull is better in dealing with diversified consumers (without a broker determining the data transfer rate for all);
Consumers can more effectively control the rate of their individual consumption;
Easier and more optimal batch processing implementation.
The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.
Others have provided answers based on Kafka's documentation but sometimes product documentation should be taken with a grain of salt as an absolute technical reference. For example:
Numerous push-based messaging systems support consumption at
different rates, usually through their session management primitives.
You establish/resume an active application layer session when you
want to consume and suspend the session (e.g. by simply not
responding for less than the keepalive window and greater than the in-flight windows...or with an explicit message) when you want to
stop/pause. MQTT and AMQP, for example both provide this capability
(in MQTT's case, since the late 90's). Given that no actions are
required to pause consumption (by definition), and less traffic is
required under steady stable state (no request), it is difficult to
see how Kafka's pull-based model is more efficient.
One critical advantage push messaging has vs. pull messaging is that
there is no request traffic to scale as the number of potentially
active topics increases. If you have a million potentially active
topics, you have to issue queries for all those topics. This
concern becomes especially relevant at scale.
The critical advantage pull messaging has vs push messaging is replayability. This factors a great deal into whether downstream systems can offer guarantees around processing (e.g. they might fail before doing so and have to restart or e.g. fail to write messages recoverably).
Another critical advantage for pull messaging vs push messaging is buffer allocation. A consuming process can explicitly request as much data as they can accommodate in a pre-allocated buffer, rather than having to allocate buffers over and over again. This gains back some of the goodput losses vs push messaging from query scaling (but not much). The impact here is measurable, however, if your message sizes vary wildly (e.g. a few KB->a few hundred MB).
It is a fallacy to suggest that pull messaging has structural scalability advantages over push messaging. Partitioning is what is usually used to provide scale in messaging applications, regardless of the consumption model. There are push messaging systems operating well in excess of 300M msgs/sec on hard wired local clusters...125K msgs/sec doesn't even buy admission to the show. In fact, pull messaging has inferior goodput by definition and systems like Kafka usually end up with more hardware to reach the same performance level. The benefits noted above may often make it worth the cost. I am unaware of anyone using Kafka for messaging in high frequency trading, for example, where microseconds matter.
It may be interesting to note that various push-pull messaging systems were developed in the late 1990s as a way to optimize the goodput. The results were never staggering and the system complexity and other factors often outweigh this kind of optimization. I believe this is Jay's point overall about practical performance over real data center networks, not to mention things like the open Internet.
Pushing is just extra work for the broker. With Kafka, the responsibility of fetching messages is on consumers. Consumers can decide at what rate they want to process the messages.
If a broker is pushing messages and if some of the consumers are down, the broker will retry certain times to push the messages till they decide not to push anymore. This decreases performance. Imagine the workload of pushing messages to multiple consumers.
I am trying to profile an Akka app that is constantly at or near 100% CPU usage. I took a CPU sample using visualvm. The sample indicates that there are 2 threads that make up 98.9% of CPU usage. 79% of the cpu time was spent on a method called sun.misc.Unsafe. Other answers on SO say that it just means that a thread is waiting but in the native implementation layer (outside of the jvm).
In questions similar to mine, people have been told to look elsewhere without being given specifics. Where should I look to figure out what's causing the cpu spike?
The application is a server that primarily uses Akka IO to listen for TCP socket connections.
Without seeing any of the source code, or even knowing what IO channel you are talking about (sockets, files, etc), there is very little insight that anyone here can give you.
I do have some rather general suggestions though.
First, you should be using reactive techniques and reactive IO in your application. This issue could be occurring because you are polling the status of some resource in a tight loop, or using a blocking call when you should be using a reactive one. This tends to be an anti-pattern and a performance drain exactly because you can spend CPU cycles doing nothing but "actively waiting". I recommend double checking for:
resource polling
blocking calls
system calls
disk flushes
waiting on a Future when it would be appropriate to map it instead
Second, you should NOT be using Mutexes or other thread synchronization in your application. If so, then you might be suffering from a live-lock. Unlike dead-locks, live-locks manifest with symptoms like 100% CPU usage as threads constantly lock and unlock concurrency primitives in an attempt to "catch them all". Wikipedia has a nice technical description of what a live lock looks like. With Akka in place you shouldn't have any need to use Mutexes or any thread synchronization primitives. If you are then you probably need to re-design your application.
Third, you should be throttling IO (as well as error handling like reconnection attempts). This issue could be occurring because your system lacks effective throttling. Often with data channels we leave their bandwidth unconstrained. However this can become an issue when that channel reaches 100% saturation and begins to steal resources from other parts of the system. This can happen, for example, if you are moving large files around without a reasonable limit.
Alternatively, you also need to throttle connection retries when you encounter any errors, rather than retrying immediately. Lots of systems will attempt to reconnect to a server if they lose their connection. While normally desirable, this can lead to problematic behavior if you use a naive reconnection strategy. For example, imagine a network client that was written this way:
class MyClient extends Client {
... other code...
def onDisconnect() = {
reconnect()
}
}
Every time the Client disconnects for ANY reason it will attempt to reconnect. You can see how this would cause a tight loop between the error handling code and the Client if the Wifi cut-out or a network cable was unplugged.
Fourth, your application should have well defined data sources and sinks. Your issue could be caused by a "data loop", that is some set of Akka actors that are just sending messages to the next actor in the chain, with the last actor sending the message back to the first actor in the chain. Make sure you have a clear and definite way for messages to enter and exit your system.
Fifth, use appropriate profiling and instrumentation for your application. Instrument your application using Kamon or Coda Hale's Metrics library.
Finding an appropriate profiler will be more difficult, since we as a community have far to go to develop mature tools for reactive applications. Personally I have found visualvm useful, but not always overwhelmingly helpful for detecting code paths that are CPU bound. The issue is that sampling profilers are only able to collect data when the JVM reaches a safepoint. This has the potential to bias certain code paths. The fix is to use a profiler that supports AsyncGetStackTrace.
Best of luck! And please add more context if you can.
I have an existing system and am wondering if MSMQueue can retain value of queue if it restarts. It clears the value when I restart.
As paxdiablo writes MSMQ is a persistent queueing solution, but not by default! The default is to store messages in RAM and to have MSMQ to persist messages to disk so they are not lost in case of a server crash you have to specify it on EACH message.
More information on this can be found if you take a look at the property Message.Recoverable.
As #Kjell-Åke Gafvelin already said, you may configure each message, but the IMHO more convenient way would be to set it on the Queue itself.
MessageQueue msgQ = new MessageQueue(#".\private$\Orders");
msgQ.DefaultPropertiesToSend.Recoverable = true;
msgQ.Send("This message will be marked as Recoverable");
msgQ.Close();
From the article above (highlights by me):
By default, MSMQ stores some messages in memory for increased
performance, and a message may be sent and received from a queue
without ever having been written to disk.
Aditionally, you should make the queue transactional to guarantee the correct shipment and receiving of a message.
(Edit 2020-10-27: Removed link to external Microsoft post "Reliable messaging with MSMQ and .NET" as it is not available anymore.)
Yes, MSMQ is a persistent queueing solution. It stores messages securely on backing storage that will not be affected by loss of power (unless you experience things like the disk blowing apart from a truly massive power surge of course).
Its whole point is to provide reliable queueing of messages in a potentially unreliable environment. To that end, losing messages when a particular server went down would be a considerable disadvantage.
From Microsoft's own pages (and apologies for the sales-pitch-like language):
Message Queuing applications can use the Message Queuing infrastructure to communicate across heterogeneous networks and with computers that may be offline. Message Queuing provides guaranteed message delivery, efficient routing, security, transaction support, and priority-based messaging.