Prometheus statsd-exporter - how to tag status code in request duration metric (histogram) - kubernetes

I have setup statsd-exporter to scrape metric from gunicorn web server. My goal is to filter request duration metric only for successful request(non 5xx), however in statsd-exporter there is no way to tag status code in duration metric. Can anyone suggest a way to add status code in request duration metric or a way to filter only successful request duration in prometheus.
In particular I want to extract successful request duration hitogram from statsd-exporter to prometheus.

To export successful request duration histogram metrics from gunicorn web server to prometheus you would need to add this functionality in gunicorn sorcecode.
First take a look at the code that exports statsd metrics here.
You should see this peace of code:
status = resp.status
...
self.histogram("gunicorn.request.duration", duration_in_ms)
By changing the code to sth like this:
self.histogram("gunicorn.request.duration.%d" % status, duration_in_ms)
from this moment you will have metrics names exported with status codes like gunicorn_request_duration_200 or gunicorn_request_duration_404 etc.
You can also modify it a little bit and move status codes to label by adding a configuration like below to your statsd_exporter:
mappings:
- match: gunicorn.request.duration.*
name: "gunicorn_http_request_duration"
labels:
status: "$1"
job: "gunicorn_request_duration"
So your metrics will now look like this:
# HELP gunicorn_http_request_duration Metric autogenerated by statsd_exporter.
# TYPE gunicorn_http_request_duration summary
gunicorn_http_request_duration{job="gunicorn_request_duration",status="200",quantile="0.5"} 2.4610000000000002e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="200",quantile="0.9"} 2.4610000000000002e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="200",quantile="0.99"} 2.4610000000000002e-06
gunicorn_http_request_duration_sum{job="gunicorn_request_duration",status="200"} 2.4610000000000002e-06
gunicorn_http_request_duration_count{job="gunicorn_request_duration",status="200"} 1
gunicorn_http_request_duration{job="gunicorn_request_duration",status="404",quantile="0.5"} 3.056e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="404",quantile="0.9"} 3.056e-06
gunicorn_http_request_duration{job="gunicorn_request_duration",status="404",quantile="0.99"} 3.056e-06
gunicorn_http_request_duration_sum{job="gunicorn_request_duration",status="404"} 3.056e-06
gunicorn_http_request_duration_count{job="gunicorn_request_duration",status="404"} 1
And now to query all metrics except these with 5xx status in prometheus you can run:
gunicorn_http_request_duration{status=~"[^5].*"}
Let me know if it was helpful.

Related

Metrics from spring batch are not pushed to prometheus push gateway

I followed the approaches mentioned in this post. Basically I have my local prometheus and push gateway setup using docker from spring batch examples.
I have the below dependencies added in my build.gradle which means PrometheusPushGatewayManager bean is auto-configured and needs to push metrics to the gateway configured.
implementation("io.micrometer:micrometer-registry-prometheus:1.8.4")
implementation("io.prometheus:simpleclient_pushgateway:0.16.0")
My application.yml looks like below
metrics:
export:
prometheus:
enabled: true
pushgateway:
enabled: true
base-url: http://0.0.0.0:9091
job: main-job
push-rate: 5s
descriptions: true
But when I navigate to /metrics endpoint, the metrics are having values as 0.
example :
spring_batch_step_seconds_max{instance="",job="job",job_name="job-job-flow",name="process-5.csv",status="FAILED"} 0
spring_batch_step_seconds_max{instance="",job="job",job_name="job-job-flow",name="process-6.csv",status="COMPLETED"} 0
spring_batch_step_seconds_max{instance="",job="job",job_name="job-job-flow",name="process-7.csv",status="FAILED"} 0
spring_batch_step_seconds_max{instance="",job="job",job_name="job-job-flow",name="process-2csv",status="FAILED"} 0
spring_batch_step_seconds_max{instance="",job="job",job_name="job-job-flow",name="start-job-job",status="COMPLETED"} 0
I've checked this post, which indicates that we need to configure a registry but if I'm using the auto configured PrometheusPushGatewayManager by adding the simpleclient_pushgateway dependency, how to configure a registry ?
Keeping a breakpoint and viewing the value of Metrics.globalRegistry.meters[1] shows values like SampleImpl{duration(seconds)=392.074203242, duration(nanos)=3.92074203242E11, startTimeNanos=1098399187818886}. So they are captured but not pushed properly.
Am I missing something to configure for getting the metrics pushed properly to the gateway ?

Quarkus service endpoint always returns system info only

I can confirm that the endpoints are working in the unittest through io.restassured.RestAssured. However, after I launched the service, every endpoint always returns a page of system info, e.g.
# HELP kafka_producer_node_request_total The total number of requests sent
# TYPE kafka_producer_node_request_total counter
kafka_producer_node_request_total{client_id="kafka-producer-metric-message-out",kafka_version="2.5.0",node_id="node--1",} 2.0
# HELP kafka_producer_connection_close_total The total number of connections closed
# TYPE kafka_producer_connection_close_total counter
kafka_producer_connection_close_total{client_id="kafka-producer-metric-message-out",kafka_version="2.5.0",} 0.0
# HELP kafka_producer_request_total The total number of requests sent
# TYPE kafka_producer_request_total counter
kafka_producer_request_total{client_id="kafka-producer-metric-message-out",kafka_version="2.5.0",} 2.0
# HELP kafka_producer_node_response_total The total number of responses received
# TYPE kafka_producer_node_response_total counter
kafka_producer_node_response_total{client_id="kafka-producer-metric-message-out",kafka_version="2.5.0",node_id="node--1",} 2.0
# HELP kafka_producer_node_response_rate The number of responses received per second
# TYPE kafka_producer_node_response_rate gauge
From the log I can see that the DBs are connected and schemas are evolved,
but where does such this info come from and why does it hijack my normal endpoints?
What a coincidence, it turned out that my application end point is also /metrics, and the quarkus.http.non-application-root-path=/, and therefore it keeps getting hijacked by quarkus metrics.
Thanks to #loicmathieu.
The solution is to configure the quarkus metrics endpoint:
quarkus.http.non-application-root-path=/
quarkus.smallrye-health.root-path=/quarkus-metrics/health
quarkus.smallrye-health.ui.root-path=/quarkus-metrics/health-ui

How to wait for tekton pipelinRun conditions

I have the following code within a gitlab pipeline which results in some kind of race condition:
kubectl apply -f pipelineRun.yaml
tkn pipelinerun logs -f pipeline-run
The tkn command immediately exits, since the pipelineRun object is not yet created. There is one very nice solution for this problem:
kubectl apply -f pipelineRun.yaml
kubectl wait --for=condition=Running --timeout=60s pipelinerun/pipeline-run
tkn pipelinerun logs -f pipeline-run
Unfortunately this is not working as expected, since Running seems to be no valid condition for a pipelineRun object. So my question is: what are the valid conditions of a pipelineRun object?
I didn't search too far and wide, but it looks like they only have two condition types imported from the knative.dev project?
https://github.com/tektoncd/pipeline/blob/main/vendor/knative.dev/pkg/apis/condition_types.go#L32
The link above is for the imported condition types from the pipeline source code of which it looks like Tekton only uses "Ready" and "Succeeded".
const (
// ConditionReady specifies that the resource is ready.
// For long-running resources.
ConditionReady ConditionType = "Ready"
// ConditionSucceeded specifies that the resource has finished.
// For resource which run to completion.
ConditionSucceeded ConditionType = "Succeeded"
)
But there may be other imports of this nature elsewhere in the project.
Tekton TaskRuns and PipelineRun only use a condition of type Succeeded.
Example:
conditions:
- lastTransitionTime: "2020-05-04T02:19:14Z"
message: "Tasks Completed: 4, Skipped: 0"
reason: Succeeded
status: "True"
type: Succeeded
The different status and messages available for the Succeeded condition are available in the documentation:
TaskRun: https://tekton.dev/docs/pipelines/taskruns/#monitoring-execution-status
PipelineRun: https://tekton.dev/docs/pipelines/pipelineruns/#monitoring-execution-status
As a side note, there is an activity timeout available in the API. That timeout is not surfaced to the CLI options though. You could create a tkn feature request for that.

How to get YARN "Memory Total" and "VCores Total" metrics programmatically in pyspark

I'm lingering around this:
https://docs.actian.com/vectorhadoop/5.0/index.html#page/User/YARN_Configuration_Settings.htm
but none of those configs are what I need.
"yarn.nodemanager.resource.memory-mb" was promising, but it's only for the node manager it seems and only gets master's mem and cpu, not the cluster's.
int(hl.spark_context()._jsc.hadoopConfiguration().get('yarn.nodemanager.resource.memory-mb'))
You can access those metrics from Yarn History Server.
URL: http://rm-http-address:port/ws/v1/cluster/metrics
metrics:
totalMB
totalVirtualCores
Example response (can be also XML):
{ "clusterMetrics": {
"appsSubmitted":0,
"appsCompleted":0,
"appsPending":0,
"appsRunning":0,
"appsFailed":0,
"appsKilled":0,
"reservedMB":0,
"availableMB":17408,
"allocatedMB":0,
"reservedVirtualCores":0,
"availableVirtualCores":7,
"allocatedVirtualCores":1,
"containersAllocated":0,
"containersReserved":0,
"containersPending":0,
"totalMB":17408,
"totalVirtualCores":8,
"totalNodes":1,
"lostNodes":0,
"unhealthyNodes":0,
"decommissioningNodes":0,
"decommissionedNodes":0,
"rebootedNodes":0,
"activeNodes":1,
"shutdownNodes":0 } }
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Metrics_API
All you need is to figure out your Yarn History Server address and port- check in your configuration files, can't help you with this since I don't know where do you manage Yarn.
When you have the URL, access it with python:
import requests
url = 'http://rm-http-address:port/ws/v1/cluster/metrics'
reponse = requests.get(url)
# Parse the reponse json/xml and get the relevant metrics...
Of course no Hadoop or Spark Context is needed in this solution

Haproxy exporter unable to fetch data

I am using haproxy_exporter in prometheus and add prometheus as a datasource in grafana and the haproxy plugin using prometheus as a datasource in order to fetch haproxy stats and shown in grafana server. And i am not able to get the output from it.
When i run below command, I am getting error invalid URL port.
./haproxy_exporter --no-haproxy.ssl-verify --haproxy.scrape-uri="http://user:$(cat pwfile)192.168.1.10:10000/haproxy/stats;csv"
OUTPUT:
INFO[0000] Starting haproxy_exporter (version=0.9.0, branch=master, revision=0cae8ee3e3f3b7c517db2cc68f386672d8b1b6a7) source=haproxy_exporter.go:495
INFO[0000] Build context (go=go1.10.1, user=root#rlinux57, date=20180724-16:08:06) source=haproxy_exporter.go:496
INFO[0000] Listening on :9101 source=haproxy_exporter.go:521
**ERRO[0013] Can't scrape HAProxy: Get http://admin:abEDokA("192.168.1.10:10000/haproxy/stats;csv: invalid URL port abEDokA("192.168.1.10:10000" source=haproxy_exporter.go:315**
And when i placed # sign between password and IP address, such as ./haproxy_exporter --no-haproxy.ssl-verify --haproxy.scrape-uri="http://admin:abEDokA("#192.168.1.10:10000/haproxy/stats;csv"
It gives below error:
INFO[0000] Starting haproxy_exporter (version=0.9.0, branch=master, revision=0cae8ee3e3f3b7c517db2cc68f386672d8b1b6a7) source=haproxy_exporter.go:495
INFO[0000] Build context (go=go1.10.1, user=root#rlinux57, date=20180724-16:08:06) source=haproxy_exporter.go:496
FATA[0000] parse http://admin:abEDokA("#192.168.1.10:10000/haproxy/stats;csv: net/url: invalid userinfo source=haproxy_exporter.go:500
And my prometheus settings are:
- job_name: 'haproxy'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9101']
You need the # in there and you might need to get rid of the " in your password. Maybe simply escaping it (\") could work, but the second error message suggests haproxy_exporter somehow correctly receives the URL as http://admin:abEDokA("#192.168.1.10:10000/haproxy/stats;csv but is then unable to parse it.
Yup, according to http://www.ietf.org/rfc/rfc1738.txt, " is not a valid character in a URL. You may get around it by using its escape, %22.