Prometheus using multiple target - kubernetes

We need to monitor several of target with prometheus, when we have a short list of targets
it was not a problem to modify, however we need to add many targets (50-70 new targets) from diffrent clusters
My question if there is a more elegant way to achieve this
instead of using it like this
- job_name: blackbox-http # To get metrics about the exporter’s targets
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://clusterA
- https://clusterA
- http://clusterB
- http://clusterC
- http://clusterC
...
maybe to mount additional files for each cluster , I mean to provide a file with targets for clusterA only and new file for clusterB only etc, is it possible ?
And the same for jobs, mount each job from a file

When you have a growing or variable list of targets the best way of managing the job definition is to use SRV records instead of static_configs.
With SRV records you only need to define a dns_sd_config with only one target that will be resolved using a DNS query, then you don't need to change the configuration every time you add a new target only add it on the DNS record
An example from the documentation here adapted to your question:
- job_name: 'myjob'
metrics_path: /probe
params:
module: [http_2xx]
dns_sd_configs:
- names:
- 'telemetry.http.srv.example.org'
- 'telemetry.https.api.srv.example.org'
You can use an internal DNS service to generate those records, and if you have targets with http and https mixed you probably need to have two records because the SRV record defines the port to use.

Related

How can I make Grafana:Loki parse through HAProxy Logs and appropriately assign labels to specific line items in the log?

I'm trying to get Grafana:Loki & PromTail working in my environment. Our goal is to pull information from /var/logs/haproxy.log to track the traffic hitting each of our servers. Specifically, client IP addresses so we can graph it out over time. HAProxy has an exporter service that historically seems to work with Prometheus, however, we're unable to setup the exporter service due to specific security requirements on our end. Additionally, that involves a reboot that we do not want to do at the moment. So we've discovered Loki by Grafana that can pull the raw log, but it's up to us to design a proper regular expression configuration that pulls the information we want.
Long story long, I've managed to get Loki setup without much of an issue. Same with Promtail. However, I ran into an issue trying to configure Promtail to appropriately grab the information we want from our log files. I was able to find a regular expression that somebody else has written along with some labels, however, the labeling does not work appropriately on Grafana. So I'm kind of stuck. Below is the config file from Promtail with two stages: 1 to parse the data of the log and 2 to label said data.
I'm not sure this is the best way to approach this, but I'm stuck and don't know what to do. Is there a better way to grab the specific information I want from the HAProxy logs? Or is anyone able to help me make a regular expression/label for the information I want?
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
# This file is promtail A file that records log offsets , Every time you collect this file, it will be updated
# Even if the service goes down , The next restart will also start from the log offset recorded in this file
filename: /tmp/positions.yaml
clients:
- url: http://10.10.140.53:3100/loki/api/v1/push
scrape_configs:
# The following configuration section is similar to Prometheus Very similar.
- job_name: system
static_configs:
- targets:
- 10.10.140.53
labels:
# Static tags , On behalf of the job All logs under include at least these two tags
job: haproxy
# level, because haproxy The log itself is not well wrapped log level, therefore loki It doesn't work out
# If the normal log contains the mark of standard log registration, this label does not need to be set
level: info
__path__: /var/log/*log.1
pipeline_stages:
- match:
# This part deals with logs ,selector Selectors are used to filter log streams that meet the criteria
selector: '{job="haproxy"}'
stages:
- regex:
# RE2 Regular expression of format ,?P<XXX> Represents setting the matching part as a variable
expression: '^[^[]+\s+(\w+)\[(\d+)\]:([^:]+):(\d+)\s+\[([^\]]+)\]\s+[^\s]+\s+(\w+)\/(\w+)\s+(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d*)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s(\d+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)\s+(\d+)\/(\d+)\s+\{([^}]*)\}\s\{([^}]*)\}\s+\"([^"]+)\"$'
# Don't filter out my time and other records , With output Variable output
- output:
source: output
- match:
# Although I still use the same conditional selector here , But the log stream at this time has been processed once
# So the unwanted information no longer exists
selector: '{job="haproxy"}'
stages:
- regex:
expression: 'frontend:(?P<frontend>\S+) haproxy:(?P<haproxy>\S+) client:(?P<client>\S+) method:(?P<method>\S+) code:(?P<code>\S+) url:(?P<url>\S+) destination:(?P<destination>\S+)}?$'
- labels:
# Dynamic label generation
frontend:
method:
code:
destination:
Here's a log file example:
Dec 13 11:35:50 haproxy haproxy[8733]: frontend:app_https_frontend/haproxy/10.10.150.53:443 client:111.222.333.444:38034 GMT:13/Dec/2022:11:35:50 +0000 body:- request:GET /bower_components/modernizr/modernizr.js?v=3.30.0 HTTP/1.1

Request URI too large for Grafana - kubernetes dashboard

We are running nearly 100 instances in Production for kubernetes cluster and using prometheus server to create Grafana dashboard. To monitor the disk usage , below query is used
(sum(node_filesystem_size_bytes{instance=~"$Instance"}) - sum(node_filesystem_free_bytes{instance=~"$Instance"})) / sum(node_filesystem_size_bytes{instance=~"$Instance"})
As Instance ip is getting replaced and we are using nearly 80 instances, I am getting error as "Request URI too large".Can someone help to fix this issue
You only need to specify the instances once and use the on matching operator to get their matching series:
(sum(node_filesystem_size_bytes{instance=~"$Instance"})
- on(instance) sum(node_filesystem_free_bytes))
/ on(instance) sum(node_filesystem_size_bytes)
Consider also adding a unifying label to your time series so you can do something like ...{instance_type="group-A"} instead of explicitly specifying instances.

Helm globally controlling values of dependencies from requirements

We have a set of Micro-Services (MS-a, MS-b, MS-c..) each has its own dependencies specified in requirements.yaml. Some have the same requirements, e.g mongodb.
For deployment of the overall application we created an umbrella-chart that references all MSs as its dependencies in requirements.yaml.
We provide a single values.yaml file to the umbrella-chart. That contains all the values for all the MS charts. That works fine till it comes to providing values to all the dependency charts of MS charts.
One prominent example would be mongodb.clusterDomain
In values.yaml file the clusterDomain value would have to be repeated for each MS section:
MS-a:
mongodb:
clusterDomain: cluster.local
MS-b:
mongodb:
clusterDomain: cluster.local
that screams for trouble, when it comes to maintainability. Is there a way to move those values to some global section, so that it is only specified once ? e.g:
global:
mongodb:
clusterDomain: cluster.local
I have tried to use anchors https://helm.sh/docs/chart_template_guide/yaml_techniques/#yaml-anchors
it would look like that:
global:
mongodb:
clusterDomain: &clusterDomain cluster.local
MS-a:
mongodb:
clusterDomain: *clusterDomain
MS-b:
mongodb:
clusterDomain: *clusterDomain
It does not reduce the structure complexity, but it makes it easier to maintain, because the value needs to be set in one place only.
however, it seems to have a very nasty pitfall when it comes to overriding values via --set.
According to the link above:
While there are a few cases where anchors are useful, there is one aspect of them that can cause subtle bugs: The first time the YAML is consumed, the reference is expanded and then discarded.
It practically means it will not be possible to override the clusterDomain values by providing:
--set global.mongodb.clusterDomain=cluster.local
because it will only replace the global section, while all the other places will remain with their original value.
Is there any other way(s) to make it possible to set subcharts values globally in one place?

Create custom Argo artifact type

Whenever an S3 artifact is used, the following declaration is needed:
s3:
endpoint: s3.amazonaws.com
bucket: "{{workflow.parameters.input-s3-bucket}}"
key: "{{workflow.parameters.input-s3-path}}/scripts/{{inputs.parameters.type}}.xml"
accessKeySecret:
name: s3-access-user-creds
key: accessKeySecret
secretKeySecret:
name: s3-access-user-creds
key: secretKeySecret
It would be helpful if this could be abstracted to something like:
custom-s3:
bucket: "{{workflow.parameters.input-s3-bucket}}"
key: "{{workflow.parameters.input-s3-path}}/scripts/{{inputs.parameters.type}}.xml"
Is there a way to make this kind of custom definition in Argo to reduce boilerplate?
For a given Argo installation, you can set a default artifact repository in the workflow controller's configmap. This will allow you to only specify the key (assuming you set everything else in the default config - if not everything is defined for the default, you'll need to specify more things).
Unfortunately, that will only work if you're only using one S3 config. If you need multiple configurations, cutting down on boilerplate will be more difficult.
In response to your specific question: not exactly. You can't create a custom some-keyname (like custom-s3) as a member of the artifacts array. The exact format of the YAML is defined in Argo's Workflow Custom Resource Definition. If your Workflow YAML doesn't match that specification, it will be rejected.
However, you can use external templating tools to populate boilerplate before the YAML is installed in your cluster. I've used Helm before to do exactly that with a collection of S3 configs. At the simplest, you could use something like sed.
tl;dr - for one S3 config, use default artifact config; for multiple S3 configs, use a templating tool.

How To Reduce Prometheus(Federation) Scrape Duration

I have a Prometheus federation with 2 prometheus' servers - one per Kubernetes cluster and a central to rule them all.
Over time the scrape durations increase. At some point, the scrape duration exceeds the timeout duration and then metrics get lost and alerts fire.
I’m trying to reduce the scrape duration by dropping metrics but this is an uphill battle and more like sisyphus then Prometheus.
Does anyone know a way to reduce the scrape time without losing metrics and without having to drop more and more as times progresses?
Thanks in advance!
Per Prometheus' documentation, these settings determine the global timeout and alerting rules evaluation frequency:
global:
# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ]
# How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ]
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
...and for each scrape job the configuration allows setting job-specific values:
# The job name assigned to scraped metrics by default.
job_name: <job_name>
# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
Not knowing more about the number of targets and number of metrics per target...I can suggest to try to configure appropriate scrape_timeout per job and adjust the global evaluation_interval accordingly.
Another option, in combination with the suggestion above or on its own, can be to have prometheus instances dedicated on scraping non-overlapping set of targets. Thus, making it possible to scale prometheus and to have different evaluation_interval per set of targets. For example, longer scrape_timeout and less frequent evaluation_interval (higher value) for jobs that take longer so that they don't affect other jobs.
Also, check if an exporter isn't misbehaving by accumulating metrics over time instead of just providing current readings at the time of scraping - otherwise, the list of what's returned to prometheus will keep on growing over time.
It isn't recommended to build data replication on top of Prometheus federation, since it doesn't scale with the number of active time series as could be seen in the described case. It is better setting up data replication via Prometheus remote_write protocol. For example, add the following lines to Prometheus config in order to enable data replication to VictoriaMetrics remote storage located at the given url:
remote_write:
- url: http://victoriametrics-host:8428/api/v1/write
The following docs may be useful for further reading:
remote_write config docs
supported remote storage systems in Prometheus
remote_write tuning docs