Locality LoadBalacing not working on Istio - kubernetes

We have a kubernetes cluster with ~100 nodes with istio and want to enable the Locality LoadBalancing feature. This will save us up to 70k USD/year because our interzone data traffic is too high.
I've followed the docs and setup the istio configmap like this:
...
meshNetworks: {}
localityLbSetting:
enabled: true
distribute:
- from: us-east-1/us-east-1a/*
to:
"us-east-1/us-east-1a/*": 100
- from: us-east-1/us-east-1b/*
to:
"us-east-1/us-east-1b/*": 100
...
And then deployed 2 apps, one of them just responds with the zone where the node is deployed (we are using a VirtualService) and the another one just do the requests.
The requests that are coming from node in us-east-1a should only be replied by the nodes in the same zone, right?
But it's not happening.
We also tried to set this variable inside pilot pods:
PILOT_ENABLE_LOCALITY_LOAD_BALANCING
When I get logs from one pod that is deployed in zone "us-east-1a" it shows replies from both zones.
Istio Version: 1.2.8
Kubernetes Version: 1.14
Any help is appreciated! Thank you!

I'm afraid your configuration is invalid in case of 'Locality' weights between regions/zones in context of 'Locality Load Balancing' feature in 'distribute' mode.
The logs of your istio-pilot should give you a clue about it, in the form of warning similar to this one:
<timestamp> warn failed to read mesh configuration, using default: 1 error occurred:
* locality weight must not be in range [1, 100]
I don't think you can find it documented anywhere in Istio documentation, but the logic behind the weights' validation can be found here.

From #panicked's comment:
The pods where the requests are generated (src pods) have to belong to a K8s service themselves too, even if the service is not directly involved in the request.
As a side note, K8s recommends:
If the goal of the operator is not to distribute load across zones and regions but rather to restrict the regionality of failover to meet other operational requirements an operator can set a ‘failover’ policy instead of a ‘distribute’ policy.
but the distribute seems to work just fine.
For failover (and failoverPriority) you must also have outlierDetection defined.

Related

How to set MaxRevisionTimeoutSeconds in Knative?

I have deployed a service using Cloud run on gke which uses Knative as an abstraction over k8s. The default MaxRevisionTimeoutSeconds is set to 600s in the knative default config but according to this PR this is customizable.
I couldn't find anything in the official Knative documentation, can anybody help me out here?
UPDATE:
After digging a bit more in knative source code and documentation. It looks like that the MaxRevisionTimeoutSeconds is defined in resource=ConfigMap/config-defaults. So have to update it with custom value.
From this it looks like we can use something called as operator to modify the ConfigMap resource but it did not work probably because gcp's does not use operator to install Knative components. Anyways I went on to install the operator and then used resource=knativeserving to overwrite the config-defaults. But this also did not work when I tried re-deploying service.
The next solution is to directly edit the config-defaults using kubectl edit. I even tried doing this but encountered weird behavior. After editing the YAML file when I used kubectl describe to check the changed value, it sometimes shows the modified value, sometimes shows the old value, and sometimes doesn't even show that particular key-value pair in the YAML. Also, it doesn't work when trying to re-deploy the service after doing this edit.
If anyone can help me with this, it would be really great.
MaxRevisionTimeoutSeconds is a cluster-global setting which enforces the max value for TimeoutSeconds on each Revision. This value exists so that cluster administrators can set upper bounds on the amount of time a single HTTP request can be in the system. Knowing an upper bound can be useful when configuring graceful shutdown settings on the HTTP routing components to prevent dropped requests during upgrades.
It's possible that Cloud Run on GKE has overridden these configurations so that they can upgrade the underlying Istio and Knative components on a predictable schedule. (If you have a 10% upgrade budget and it takes 10m to drain a component, your minimum upgrade time is probably around 110m, taking into account additional scheduling / image fetch / startup time.)

One Traefik Pod in Kubernetes fails with error: 'command traefik error: field not found, node: redirect'

I'm running Traefik on a Kubernetes cluster to manage Ingress, which has been running ok for a long time.
I recently implemented Cluster-Autoscaling, which works fine except that on one Node (newly created by the Autoscaler) Traefik won't start. It sits in CrashLoopBackoff, and when I log the Pod I get: [date] [time] command traefik error: field not found, node: redirect.
Google found no relevant results, and the error itself is not very descriptive, so I'm not sure where to look.
My best guess is that it has something to do with the RedirectRegex Middleware configured in Traefik's config file:
[entryPoints.http.redirect]
regex = "^http://(.+)(:80)?/(.*)"
replacement = "https://$1/$3"
Traefik actually works still - I can still access all of my apps from their urls in my browser, even those which are on the node with the dead Traefik Pod.
The other Traefik Pods on other Nodes still run happily, and the Nodes are (at least in theory) identical.
After further googling, I found this on Reddit. Turns out Traefik updated a few days ago to v2.0, which is not backwards compatible.
Only this pod had the issue, because it was the only one for which a new (v2.0) image was pulled (being the only recently created Node).
I reverted to v1.7 until I have time to fix it properly. Had update the Daemonset to use v1.7, then kill the Pod so it could be recreated from the old image.
The devs have a Migration Guide that looks like it may help.
"redirect" is gone but now there is "RedirectScheme" and "RedirectRegex" as a new concept of "Middlewares".
It looks like they are moving to a pipeline approach, so you can define a chain of "middlewares" to apply to an "entrypoint" to decide how to direct it and what to add/remove/modify on packets in that chain. "backends" are now "providers", and they have a clearer, modular concept of configuration. It looks like it will offer better organization than earlier versions.

How Can I Monitor Persistent Volume Metrics in Kubernetes 1.13?

I have a kubernetes 1.13 cluster running on Azure and I'm using multiple persistent volumes for multiple applications.
I have setup monitoring with Prometheus, Alertmanager, Grafana.
But I'm unable to get any metrics related to the PVs.
It seems that kubelet started to expose some of the metrics from kubernetes 1.8 , but again stopped since 1.12
I have already spoken to Azure team about any workaround to collect the metrics directly from the actual FileSystem (Azure Disk in my case). But even that is not possible.
I have also heard some people using sidecars in the Pods to gather PV metrics. But I'm not getting any help on that either.
It would be great even if I get just basic details like consumed / available free space.
I'm was having the same issue and solved it by joining two metrics:
avg(label_replace(
1 - node_filesystem_free_bytes{mountpoint=~".*pvc.*"} / node_filesystem_size_bytes,
"volumename", "$1", "mountpoint", ".*(pvc-[^/]*).*")) by (volumename)
+ on(volumename) group_left(namespace, persistentvolumeclaim)
(0 * kube_persistentvolumeclaim_info)
As an explanation, I'm adding a label volumename to every time-series of
node_filesystem*, cut out of the existing mountpoint label and then joining with the other metrics containing the additional labels. Multiplying by 0 ensures this is otherwise a no-op.
Also quick warning: I or you may be using some relabeling configs making this not work out immediately without adaptation.

What is the difference between the core os projects kube-prometheus and prometheus operator?

The github repo of Prometheus Operator https://github.com/coreos/prometheus-operator/ project says that
The Prometheus Operator makes the Prometheus configuration Kubernetes native and manages and operates Prometheus and Alertmanager clusters. It is a piece of the puzzle regarding full end-to-end monitoring.
kube-prometheus combines the Prometheus Operator with a collection of manifests to help getting started with monitoring Kubernetes itself and applications running on top of it.
Can someone elaborate this?
I've always had this exact same question/repeatedly bumped into both, but tbh reading the above answer didn't clarify it for me/I needed a short explanation. I found this github issue that just made it crystal clear to me.
https://github.com/coreos/prometheus-operator/issues/2619
Quoting nicgirault of GitHub:
At last I realized that prometheus-operator chart was packaging
kube-prometheus stack but it took me around 10 hours playing around to
realize this.
**Here's my summarized explanation:
"kube-prometheus" and "Prometheus Operator Helm Chart" both do the same thing:
Basically the Ingress/Ingress Controller Concept, applied to Metrics/Prometheus Operator.
Both are a means of easily configuring, installing, and managing a huge distributed application (Kubernetes Prometheus Stack) on Kubernetes:**
What is the Entire Kube Prometheus Stack you ask? Prometheus, Grafana, AlertManager, CRDs (Custom Resource Definitions), Prometheus Operator(software bot app), IaC Alert Rules, IaC Grafana Dashboards, IaC ServiceMonitor CRDs (which auto-generate Prometheus Metric Collection Configuration and auto hot import it into Prometheus Server)
(Also when I say easily configuring I mean 1,000-10,000++ lines of easy for humans to understand config that generates and auto manage 10,000-100,000 lines of machine config + stuff with sensible defaults + monitoring configuration self-service, distributed configuration sharding with an operator/controller to combine config + generate verbose boilerplate machine-readable config from nice human-readable config.
If they achieve the same end goal, you might ask what's the difference between them?
https://github.com/coreos/kube-prometheus
https://github.com/helm/charts/tree/master/stable/prometheus-operator
Basically, CoreOS's kube-prometheus deploys the Prometheus Stack using Ksonnet.
Prometheus Operator Helm Chart wraps kube-prometheus / achieves the same end result but with Helm.
So which one to use?
Doesn't matter + they achieve the same end result + shouldn't be crazy difficult to start with 1 and switch to the other.
Helm tends to be faster to learn/develop basic mastery of.
Ksonnet is harder to learn/develop basic mastery of, but:
it's more idempotent (better for CICD automation) (but it's only a difference of 99% idempotent vs 99.99% idempotent.)
has built-in templating which means that if you have multiple clusters you need to manage / that you want to always keep consistent with each other. Then you can leverage ksonnet's templating to manage multiple instances of the Kube Prometheus Stack (for multiple envs) using a DRY code base with lots of code reuse. (If you only have a few envs and Prometheus doesn't need to change often it's not completely unreasonable to keep 4 helm values files in sync by hand. I've also seen Jinja2 templating used to template out helm values files, but if you're going to bother with that you may as well just consider ksonnet.)
Kubernetes operator are kubernetes specific application(pods) that configure, manage and optimize other Kubernetes deployments automatically. They are implemented as a custom controller.
According to official coreOS website:
Operators were introduced by CoreOS as a class of software that operates other software, putting operational knowledge collected by humans into software.
The prometheus operator provides the easy way to deploy configure and monitor your prometheus instances on kubernetes cluster. To do so, prometheus operator introduces three types of custom resource definition(CRD) in kubernetes.
Prometheus
Alertmanager
ServiceMonitor
Now, with the help of above CRD's, you can directly create a prometheus instance by providing kind: Prometheus and the prometheus instance is ready to serve, likewise you can do for AlertManager. Without this you would have to setup the deployment for prometheus with its image, configuration and many more things.
The Prometheus Operator serves to make running Prometheus on top of Kubernetes as easy as possible, while preserving Kubernetes-native configuration options.
Now, kube-prometheus implemented the prometheus operator and provides you minimum yaml files to create your basic setup of prometheus, alertmanager and grafana by running a single command.
git clone https://github.com/coreos/prometheus-operator.git
kubectl apply -f prometheus-operator/contrib/kube-prometheus/manifests/
By running above command in kube-prometheus directory, you will get a monitoring namespace which will have an instance of alertmanager, prometheus and grafana for UI. This is enough setup for most of the basic implementation and if you need any more specifics according to your application, you can add more yamls of exporter you need.
Kube-prometheus is more of a contribution to prometheus-operator project, which implements the prometheus operator functionality very well and provide you a complete monitoring setup for your kubernetes cluster. You can start with kube-prometheus and extend the functionality of your monitoring setup according to your application from there.
You can learn more about prometheus-operator here
As of today, 28-09-2020, this is the way to install Prometheus in a Kubernetes cluster
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#kube-prometheus-stack
According to official documentation, kube-prometheus-stack is a rename of prometheus-operator.
As I understood, kube-prometheus-stack also has preinstalled grafana dashboards and prometheus rules.
Note: This chart was formerly named prometheus-operator chart, now
renamed to more clearly reflect that it installs the kube-prometheus
project stack, within which Prometheus Operator is only one component.
Taken from https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
Architecturally the container runs docker
The default container logs are managed by Docker, and the default log driver uses JSON-file
log-driver": "json-file
https://docs.docker.com/config/containers/logging/configure/
If the default jSON-file is used to manage container logs, log rotation is not performed by default. Therefore, the default JSON-file log driver the log files stored by the log driver can result in a large amount of disk space for containers that generate a large amount of output, which can cause disk space to run out.
In this case, save the log to ES, store it separately, and periodically delete the index using curator kubernetes
And run a scheduled task in K8S to delete the index periodically
Another solution for disk space is to periodically delete old logs from jSON-files
Typically we set the size and number of logs
This will set up a maximum of 10 log files, each with a maximum size of 20 Mb. Therefore, the container has a maximum of 200 Mb of logs
"log-driver": "json-file", "log-opts": { "max-size": "20m", "max-file": "10" },
Note: In general, the default Docker log is placed
/var/lib/docker/containers/
But in the same case kubernetes also saves logs and creates a directory structure to help you find pods-based logs, so you can find container logs for each Pod running on a node
/var/log/pods/<namespace>_<pod_name>_<pod_id>/<container_name>/
When removing pod, / var/lib/container under the docker/containers/log and k8s created under/var/log/pods/pod log will be deleted
For example, if the POD is restarted during production, the pod log will be deleted whether it is on the original node or jumped to another node
Therefore, this log needs to be saved in ES for centralized management. Many R&D projects will check the log for troubleshooting in most cases

kubernetes petset on google cloud

I am running a kubernetes cluster on google cloud(version 1.3.5) .
I found a redis.yaml
that uses petset to create a redis cluster but when i run kubectl create -f redis.yaml i get the following error :
error validating "redis.yaml": error validating data: the server could not find the requested resource (get .apps); if you choose to ignore these errors, turn validation off with --validate=false
i cant find why i get this error or how to solve this.
PetSet is currently an alpha feature (which you can tell because the apiVersion in the linked yaml file is apps/v1alpha1). It may not be obvious, but alpha features are not supported in Google Container Engine.
As described in api_changes.md, alpha level API objects are disabled by default, have no guarantees that they will exist in future versions, can break compatibility with older versions at any time, and may destabilize the cluster.
I'm using PetSet with some success, for example https://github.com/Yolean/kubernetes-mysql-cluster, in zone europe-west1-d but when I tried europe-west1-c I got the aforementioned error.
Google just enabled Alpha Clusters for GKE as announced here: https://cloud.google.com/container-engine/docs/alpha-clusters
Now you are able (but not SLA covered) to use all alpha features within an alpha cluster, what was disable previously.