Custom cloudwatch metrics EKS CloudWatch Agent - kubernetes

I have set up container insights as described in the Documentation
Is there a way to remove some of the metrics sent over to CloudWatch ?
Details :
I have a small cluster ( 3 client facing namespaces, ~ 8 services per namespace ) with some custom monitoring, logging, etc in their own separate namespaces, and I just want to use CloudWatch for critical client facing metrics.
The problem I am having is that the Agent sends over 500 metrics to CloudWatch, where I am really only interested in a few of the important ones, especially as AWS bills per metric.
Is there any way to limit which metrics get sent to CloudWatch?
It would be especially helpful if I could only sent metrics from certain namespaces, for example, exclude the kube-system namespace
My configmap is:
cwagentconfig.json: |
{
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "*****",
"metrics_collection_interval": 60
}
},
"force_flush_interval": 5
}
}
I have searched for a while now, but clouldn't really find anything on:
"metrics_collected": {
"kubernetes": {

I've looked as best I can and you're right, there's little or nothing to find on this topic. Before I make the obvious-but-unhelpful suggestions of either using Prometheus or asking on the AWS forums, a quick look at what the CloudWatch agent actually does.
The Cloudwatch agent gets container metrics either from from cAdvisor, which runs as part of kubelet on each node, or from the kubernetes metrics-server API (which also gets it's metrics from kubelet and cAdvisor). cAdvisor is well documented, and it's likely that the Cloudwatch agent uses the Prometheus format metrics cAdvisor produces to construct it's own list of metrics.
That's just a guess though unfortunately, since the Cloudwatch agent doesn't seem to be open source. That also means it may be possible to just set a 'measurement' option within the kubernetes section and select metrics based on Prometheus metric names, but probably that's not supported. (if you do ask AWS, the Premium Support team should keep an eye on the forums, so you might get lucky and get an answer without paying for support)
So, if you can't cut down metrics created by Container Insights, what are your other options? Prometheus is easy to deploy, and you can set up recording rules to cut down on the number of metrics it actually saves. It doesn't push to Cloudwatch by default, but you can keep the metrics locally if you have some space on your node for it, or use a remote storage service like MetricFire (the company I work for, to be clear!) which provides Grafana to go along with it. You can also export metrics from Cloudwatch and use Prometheus as your single source of truth, but that means more storage on your cluster.
If you prefer to view your metrics in Cloudwatch, there are tools like Prometheus-to-cloudwatch which actually scrape Prometheus endpoints and send data to Cloudwatch, much like (I'm guessing) the Cloudwatch Agent does. This service actually has include and exclude settings for deciding which metrics are sent to Cloudwatch.
I've written a blog post on EKS Architecture and Monitoring in case that's of any help to you. Good luck, and let us know which option you go for!

Related

GKE is built by default in Anthos solution ? Getting Anthos Metrics

I have a cluster with 7 nodes and a lot of services, nodes, etc in the Google Cloud Platform. I'm trying to get some metrics with StackDriver Legacy, so in the Google Cloud Console -> StackDriver -> Metrics Explorer I have all the set of anthos metrics listed but when I try to create a chart based on that metrics it doesn't show the data, actually the only response that I get in the panel is no data is available for the selected time frame even changing the time frame and stuffs.
Is right to think that with anthos metrics I can retrieve information about my cronjobs, pods, services like failed initializations, jobs failures ? And if so, I can do it with StackDriver Legacy or I need to Update to StackDriver kubernetes Engine Monitoring ?
Anthos solution, includes what’s called GKE-on prem. I’d take a look at the instructions to use logging and monitoring on GKE-on prem. Stackdriver monitors GKE On-Prem clusters in a similar way as cloud-based GKE clusters.
However, there’s a note where they say that currently, Stackdriver only collects cluster logs and system component metrics. The full Kubernetes Monitoring experience will be available in a future release.
You can also check that you’ve met all the configuration requirements.

Live monitoring of container, nodes and cluster

we are using k8s cluster for one of our application, cluster is owned by other team and we dont have full control over there… We are trying to find out metrics around resource utilization (CPU and memory), detail about running containers/pods/nodes etc. Need to find out how many parallel containers are running. Problem is they have exposed monitoring of cluster via Prometheus but with Prometheus we are not getting live data, it does not have info about running containers.
My query is , what is that API which is by default available in k8s cluster and can give all what we need. We dont want to read data form another client like Prometheus or anything else, we want to read metrics directly from cluster so that data is not stale. Any suggestions?
As you mentioned you will need metrics-server (or heapster) to get those information.
You can confirm if your metrics server is running kubectl top nodes/pods or just by checking if there is a heapster or metrics-server pod present in kube-system namespace.
Also the provided command would be able to show you the information you are looking for. I wont go into details as here you can find a lot of clues and ways of looking at cluster resource usage. You should probably take a look at cadvisor too which should be already present in the cluster. It exposes a web UI which exports live information about all the containers on the machine.
Other than that there are probably commercial ways of acheiving what you are looking for, for example SignalFx and other similar projects - but this will probably require the cluster administrator involvement.

Get request count from Kubernetes service

Is there any way to get statistics such as service / endpoint access for services defined in Kubernetes cluster?
I've read about Heapster, but it doesn't seem to provide these statistics. Plus, the whole setup is tremendously complicated and relies on a ton of third-party components. I'd really like something much, much simpler than that.
I've been looking into what may be available in kube-system namespace, and there's a bunch of containers and services, there, Heapster including, but they are effectively inaccessible because they require authentication I cannot provide, and kubectl doesn't seem to have any API to access them (or does it?).
Heapster is the agent that collects data, but then you need a monitoring agent to interpret these data. On GCP, for example, that's fluentd who gets these metrics and sends to Stackdriver.
Prometheus is an excellent monitoring tool. I would recommend this one, if youare not on GCP.
If you would be on GCP, then as mentioned above you have Stackdriver Monitoring, that is configured by default for K8s clusters. All you have to do is to create a Stackdriver accound (this is done by one click from GCP Console), and you are good to go.

Kubernetes - monitor number requests

I have an app running on Google Container Engine.
I would like to monitor number requests per second my api is receiving
how can I do this?
is there a way monitoring from historical metrics on Stackdriver
as I am opted for Stackdriver Premium
Looking at https://kubernetes.io/docs/tasks/debug-application-cluster/logging-stackdriver/ I see that Stackdriver is deployed in the cluster by default. It looks like that, according to https://cloud.google.com/monitoring/api/metrics, this includes the metrics you are looking for.

Monitoring and alerting on pod status or restart with Google Container Engine (GKE) and Stackdriver

Is there a way to monitor the pod status and restart count of pods running in a GKE cluster with Stackdriver?
While I can see CPU, memory and disk usage metrics for all pods in Stackdriver there seems to be no way of getting metrics about crashing pods or pods in a replica set being restarted due to crashes.
I'm using a Kubernetes replica set to manage the pods, hence they are respawned and created with a new name when they crash. As far as I can tell the metrics in Stackdriver appear by pod-name (which is unique for the lifetime of the pod) which doesn't sound really sensible.
Alerting upon pod failures sounds like such a natural thing that it sounds hard to believe that this is not supported at the moment. The monitoring and alerting capabilities that I get from Stackdriver for Google Container Engine as they stand seem to be rather useless as they are all bound to pods whose lifetime can be very short.
So if this doesn't work out of the box are there known workarounds or best practices on how to monitor for continuously crashing pods?
You can achieve this manually with the following:
In Logs Viewer, creating the following filter:
resource.labels.project_id="<PROJECT_ID>"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.namespace_name="<NAMESPACE, or default>"
jsonPayload.message:"failed liveness probe"
Create a metric by clicking on the Create Metric button above the filter input and filling up the details.
You may now track this metric in Stackdriver.
Would be happy to be informed of a built-in metric instead of this.
There is a built in metric now, so it's easy to dashboard and/or alert on it without setting up custom metrics
Metric: kubernetes.io/container/restart_count
Resource type: k8s_container
In my cluster (a bare-metal k8s cluster),I use kube-state-metrics https://github.com/kubernetes/kube-state-metrics to do what you want. This project belongs to kubernetes repo and it is quite easy to use. Once deployed u can use kube_pod_container_status_restarts this metrics to know if a container restarts
Others have commented on how to do this with metrics, which is the right solution if you have a very large number of crashing pods.
An alernative approach is to treat crashing pods as discrete events or even log-lines. You can do this with Robusta (disclaimer, I wrote this) with YAML like this:
triggers:
- on_pod_update: {}
actions:
- restart_loop_reporter:
restart_reason: CrashLoopBackOff
- image_pull_backoff_reporter:
rate_limit: 3600
sinks:
- slack
Here we're triggering an action named restart_loop_reporter whenever a pod updates. The data stream comes from the APIServer.
The restart_loop_reporter is an action which filters out non-crashing pods. Above it's configured to report only on CrashLoopBackOffs but you could remove that to report all crashes.
A benefit of doing it this way is that you can gather extra data about the crash automatically. For example, the above will fetch the pod's logs and forward them along with the crash report.
I'm sending the result here to Slack, but you could just as well send it to a structured output like Kafka (already builtin) or Stackdriver (not yet supported, but I can fix that if you like).
Remember that, you can always raise feature request if the options available are not enough.