How to monitor (micro)services? - triggers

I have a set of services. Every service contains some components.
Some of them are stateless, some of them are stateful, some are synchronous, some are asynchronous.
I used different approaches to monitoring and alerting.
Log-based alerting and metrics gathering. New Relic based. Own bicycle.
Basically, atm I am looking for a way, how to generalize and aggregate important metrics for all services in single place. One of things, I want is that we monitor more products, than separate services.
As an end result I see it as a single dashboard with small amount of widgets, but looking at those widgets I would be able to say for sure, if services are usable to end-customer.
Probably someone can recommend me some approach/methodology. Or give a reference to some best practices.

I like what you're trying to achieve! A service is not production-ready unless it's thoroughly monitored.
I believe what your're describing goes into the topics of health-checking and metrics.
... I would be able to say for sure, if services are usable to end-customer.
That however will require a little of both ;-) To ensure you're currently fulfilling your SLA, you have to make sure, that your services are all a) running and b) perform as requested. With both problems I suggest to look at the StatsD toolchain. Initially developed by Etsy, it has become the de-facto standard for gathering metrics.
To ensure all your services are running, we're relaying Kubernetes. It takes our description for what should run, be reachable from outside etc. and hosts that on our infrastructure. It also makes sure, that should things die - that they will be restarted. It helps with things like auto-scaling etc. as well! Awesome tooling and kudos to Google!
The way it ensures that is with health-checks. There are multiple ways how you can ensure your service node booted by Kubernetes is alive and kicking (namely HTTP calls and CLI scripts but this should be a modular thing should you need anything else!) If Kubernetes detects unhealthy nodes it will immediately phase them out and start another node instead.
Now, making sure, all your services perform as expected you'll need to gather some metrics. For all of our services (and all individual endpoints), we gather a few metrics via StatsD like:
Requests/sec
number of errors returned (404, etc...)
Response times (Average, Median, Percentiles depending on the services SLA)
Payload size (Average)
sometimes the number of concurrent requests per endpoint, the number of instances currently running
general metrics like the hosts current CPU and memory usage and uptime.
We gather a lot more metrics but that's about the bottom line. Since StatsD has become more of a "protocol specification" than a concrete product there are a myriad of collector, front- and backends to choose from. They help you visualize your systems state and many of them feature alerts of something or some combination of metrics go beyond their thresholds.
Let me know, if this was helpfull!

There's at least 3 types of things you will need to monitor: the host where the service is deployed, the component itself and the SLAs and some of them depend on the software stack you're using as well as the architecture.
With that said, you could for example use Nagios to monitor the hardware where the services are deployed, Splunk for the services metrics/SLAs as well as for any errors that might occur. You can also use SNMP packages in case something goes wrong and you have a more sophisticated support structure, this would be yours triggers. Without knowing how your infrastructure/services are set up it is complicated to go into deeper details.

Related

What happens if Dapr fail?

I would like to know what happens if Dapr fails. For example, if my service's sidecar or even the Control Plane fails, what is the expected behavior of my application?
Oh, and would there be any way for me to simulate these error cases?
Context:
In my application I have a service that uses Dapr, but in a non-critical way. Therefore, I would like to ensure that it continues to run normally even if your sidecar or Dapr fails.
Very good question without a straight-forward answer, but I'll share how I look at it.
Dapr can be used with monolithic, legacy applications, for migration and modernization purposes for example, but it is more commonly used with distributed applications. In a distributed application, there are many more components that can fail: database, transparent proxy (envoy/), ingress proxy, message broker, producer, consumer... In that regard, Dapr is no different, and it can fail, but there are a few reasons why that is less likely to happen:
Dapr is like a technical microservice, it has no business logic, and your app interacts with it over explicit APIs. It is harder for a failure in the sidecar to spread to your app.
If the sidecar is exploited, it is harder to get control of the application, acts as a security boundary.
As a popular open source project, Dapr has many eyes and users on it. You are more likely to get new bugs found and fixed early.
If that happens, upgrading Dapr is much easier than a library upgrade. You can upgrade Dapr control plane with little to no disruptions to your app, and then upgrade select sidecars (a canary release if you want) - I've also done many middleware/library patching/upgrades and I know how much work the latter is in comparison.
Each sidecar lives co-located with its app. Any hardware or network failure is likely to impact both the app and sidecar, rather than sidecar only.
With Dapr, you get many resiliency and observability benefits OOTB. See my blog on this topic here. It is more likely to improve the reliability of your app than reduce it.
When you follow the best practices, and enable k8s health checks, resource constraints, Kubernetes will deal with it. Dapr can even detect the health-status of your app, and stop interacting with it until it recovers.
In the end, if there is a bug in Dapr, it may fail. But that can happen wit a library implementing Dapr-like features too. With Dapr, you can isolate the failure, and upgrade faster, w/o a single line of code change, building, testing of the application, that is the difference from perspective of this question.
Disclaimer: I work for a company building products for running Dapr, and I'm highly biassed on this topic.

Is using deployments to isolate clients in Kubernetes a good idea?

We’re in the process of migrating our aging monolith to a more robust solution and landed on Kubernetes as the most appropriate platform to achieve what we’re looking for. At the same time, we’re looking to split out and isolate our client data for security and improved privacy.
What we’re considering is ultimately having one database per customer, and embedding those connection details into a deployment for each of them. We’d then build a routing service of some kind that would link a client’s request to their respective deployment/service.
Because our individual clients vary wildly in size (we have some that generate thousands of requests per minute, and others that are hundreds per day), we like the option of having the ability to scale them independently through ReplicaSets on the deployments.
However, I have some concerns regarding upper limits of how many deployments can exist/be successfully managed within a cluster, as we’d be looking at potentially hundreds of different clients, which will continue to grow. I also have concerns of costs, and how having dedicated resources (essentially an entire VM) for our smaller clients might impact our budgets.
So my questions are:
is this a good idea at all? Why or why not, and if not, are there alternative architectures we could look at to achieve the same thing?
is this solution more expensive than it needs to be?
I’d appreciate any insights you could offer, thank you!
I can think of a couple options for this situations:
Deploying separate clusters for each customer. This also allows you to size your clusters properly for each customers and configure autoscaling accordingly for each of them. The drawback is that each cluster has a management fee of 0.10$ per hour, but you get full guarantee that everything is isolated, and you can use the cluster autoscaler to make sure that only the VMs that are actually needed for each customer are running. For smaller customers, you may wanna use this with small (and cheap) machine types.
Another option would be to, as mentioned in the comments, use namespaces. However you would have to configure the cluster properly as there exist ways of accessing services in different namespaces.
Implement customer isolation in your own software running on a cluster. This would imply forcing your software to access only the database for a given customer, but I would not recommend to go this route.

Microservice, amqp and service registry / discovery

I m studying Microservices architecture and I m actually wondering something.
I m quite okay with the fact of using (back) service discovery to make request able on REST based microservices. I need to know where's the service (or at least the front of the server cluster) to make requests. So it make sense to be able to discover an ip:port in that case.
But I was wondering what could be the aim of using service registry / discovery when dealing with AMQP (based only, without HTTP possible calls) ?
I mean, using AMQP is just like "I need that, and I expect somebody to answer me", I dont have to know who's the server that sent me back the response.
So what is the aim of using service registry / discovery with AMQP based microservice ?
Thanks for your help
AMQP (any MOM, actually) provides a way for processes to communicate without having to mind about actual IP addresses, communication security, routing, among other concerns. That does not necessarily means that any process can trust or even has any information about the processes it communicates with.
Message queues do solve half of the process: how to reach the remote service. But they do not solve the other half: which service is the right one for me. In other words, which service:
has the resources I need
can be trusted (is hosted on a reliable server, has a satisfactory service implementation, is located in a country where the local laws are compatible with your requirements, etc)
charges what you want to pay (although people rarely discuss cost when it comes to microservices)
will be there during the whole time window needed to process your service -- keep in mind that servers are becoming more and more volatile. Some servers are actually containers that can last for a couple minutes.
Those two problems are almost linearly independent. To solve the second kind of problems, you have resource brokers in Grid computing. There is also resource allocation in order to make sure that the last item above is correctly managed.
There are some alternative strategies such as multicasting the intention to use a service and waiting for replies with offers. You may have reverse auction in such a case, for instance.
In short, the rule of thumb is that if you do not have an a priori knowledge about which service you are going to use (hardcoded or in some configuration file), your agent will have to negotiate, which includes dynamic service discovery.

Why bother with service discovery when message oriented middleware does the job?

I get the problem that etcd/consul/$whatever are trying to solve. Service consumers need to talk to service providers, a hugely fluid distributed system needs a mechanism to marry the two.
However, the problem of "where do service consumers go with their requests?" is old and IMO has been solved with MOM -- message oriented middleware.
In MOM, the idea is that service consumers do not care where the service providers live. They simply send a message and have the messaging bus take care of routing the message to the appropriate consumer. There can be multiple providers all doing the same thing (queue-based round-robin) or versioned providers (/v1/request goes to one, /v2/request goes to another).
This is a simple, powerful integration pattern that completely decouples a service interface from its implementation.
And yet I see this bizarre obsession with discovering service providers, which appears to create tight coupling between consumers and providers (in addition to a few other anti-patterns as well.)
So, what am I missing here? TIA.
In MOM, everything flows through the bus, so it might become a bottleneck. With service discovery, a consumer looks up a producer "once" (ok it might have to check back again after a while), and then "directly" (ok could be through a proxy) talks to it.
Or if you prefer catchy phrases: smart endpoints & dumb pipes vs (i guess) dumb endpoints & smart pipes.
Personally I don't see the two as either or for this type of architecture. You could use the service discovery to see what services are available at the moment and subscribe to the MOM for the events you then know will be there. If you can't find services you depend on you can raise an alert. Not all MOM's let you know when there is no publisher for a channel.
You can also combine them in the way that the service discovery is where you find the services you want to contact directly, for example a data store that does no job, and still use the MOM to subscribe to events for changes that other systems do. Not all use cases fit well with job queuing either, as some tasks must be solved synchronously, and then the service discovery is a great way to have a dynamic environment.
I do prefer the asynchronous MQ myself, and I think that if you do it right, with load balancing, redundancy, clustering with separate readers and writers etc you can easily have great stability, scalability and a standardized way for all your components to communicate.

Scala + Akka: How to develop a Multi-Machine Highly Available Cluster

We're developing a server system in Scala + Akka for a game that will serve clients in Android, iPhone, and Second Life. There are parts of this server that need to be highly available, running on multiple machines. If one of those servers dies (of, say, hardware failure), the system needs to keep running. I think I want the clients to have a list of machines they will try to connect with, similar to how Cassandra works.
The multi-node examples I've seen so far with Akka seem to me to be centered around the idea of scalability, rather than high availability (at least with regard to hardware). The multi-node examples seem to always have a single point of failure. For example there are load balancers, but if I need to reboot one of the machines that have load balancers, my system will suffer some downtime.
Are there any examples that show this type of hardware fault tolerance for Akka? Or, do you have any thoughts on good ways to make this happen?
So far, the best answer I've been able to come up with is to study the Erlang OTP docs, meditate on them, and try to figure out how to put my system together using the building blocks available in Akka.
But if there are resources, examples, or ideas on how to share state between multiple machines in a way that if one of them goes down things keep running, I'd sure appreciate them, because I'm concerned I might be re-inventing the wheel here. Maybe there is a multi-node STM container that automatically keeps the shared state in sync across multiple nodes? Or maybe this is so easy to make that the documentation doesn't bother showing examples of how to do it, or perhaps I haven't been thorough enough in my research and experimentation yet. Any thoughts or ideas will be appreciated.
HA and load management is a very important aspect of scalability and is available as a part of the AkkaSource commercial offering.
If you're listing multiple potential hosts in your clients already, then those can effectively become load balancers.
You could offer a host suggestion service and recommends to the client which machine they should connect to (based on current load, or whatever), then the client can pin to that until the connection fails.
If the host suggestion service is not there, then the client can simply pick a random host from it internal list, trying them until it connects.
Ideally on first time start up, the client will connect to the host suggestion service and not only get directed to an appropriate host, but a list of other potential hosts as well. This list can routinely be updated every time the client connects.
If the host suggestion service is down on the clients first attempt (unlikely, but...) then you can pre-deploy a list of hosts in the client install so it can start immediately randomly selecting hosts from the very beginning if it has too.
Make sure that your list of hosts is actual host names, and not IPs, that give you more flexibility long term (i.e. you'll "always have" host1.example.com, host2.example.com... etc. even if you move infrastructure and change IPs).
You could take a look how RedDwarf and it's fork DimDwarf are built. They are both horizontally scalable crash-only game app servers and DimDwarf is partly written in Scala (new messaging functionality). Their approach and architecture should match your needs quite well :)
2 cents..
"how to share state between multiple machines in a way that if one of them goes down things keep running"
Don't share state between machines, instead partition state across machines. I don't know your domain so I don't know if this will work. But essentially if you assign certain aggregates ( in DDD terms ) to certain nodes, you can keep those aggregates in memory ( actor, agent, etc ) when they are being used. In order to do this you will need to use something like zookeeper to coordinate which nodes handle which aggregates. In the event of failure you can bring the aggregate up on a different node.
Further more, if you use an event sourcing model to build your aggregates, it becomes almost trivial to have real-time copies ( slaves ) of your aggregate on other nodes by those nodes listening for events and maintaining their own copies.
By using Akka, we get remoting between nodes almost for free. This means that which ever node handles a request that might need to interact with an Aggregate/Entity on another nodes can do so with RemoteActors.
What I have outlined here is very general but gives an approach to distributed fault-tolerance with Akka and ZooKeeper. It may or may not help. I hope it does.
All the best,
Andy