Mongodb_exporter export only one mongodb metric ( --collect-all is on) - mongodb

I have an issue with mongodb_exporter (prometheus metrics exporter for mongodb). I think it's because of a configuration problem in my side but after 2 days of searching I'm empty of solution :)
I run mongodb on K8S and mongodb_exporter as a sidecar pod.
exporter starts OK (I think, because no error), display some metrics but my problem is that's only "go" metrics (see below), I have just one "mongodb" metric!! => mongodb_up 1. Even if i put options "--collect-all" or "--collector.collstats".
I do not have any "useful" metric on my "config" database such as collections size, etc....
Connection to the database is OK because if I change username/password/db port I ran to connection problem.
My user have correct rights I think (changing real password to "password" in my text) :
Successfully added user: {
"user" : "exporter",
"roles" : [
{
"role" : "clusterMonitor",
"db" : "admin"
},
{
"role" : "read",
"db" : "local"
}
]
}
Here is my pod configuration :
- name: metrics
image: docker.io/bitnami/mongodb-exporter:latest
command:
- /bin/bash
- '-ec'
args:
- >
/bin/mongodb_exporter --web.listen-address ":9216"
--mongodb.uri=mongodb://exporter:password#localhost:27017/config? --log.level="debug" --collect-all
ports:
- name: metrics
containerPort: 9216
protocol: TCP
env:
resources:
limits:
cpu: 50m
memory: 250Mi
requests:
cpu: 25m
memory: 50Mi
livenessProbe:
httpGet:
path: /
port: metrics
scheme: HTTP
initialDelaySeconds: 15
timeoutSeconds: 5
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: metrics
scheme: HTTP
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
Logs
Exporter start log (with debug activated) :
time="2023-02-03T09:02:25Z" level=debug msg="Compatible mode: false"
time="2023-02-03T09:02:25Z" level=debug msg="Connection URI: mongodb://exporter:password#localhost:27017/config?"
level=info ts=2023-02-03T09:02:25.224Z caller=tls_config.go:195 msg="TLS is disabled." http2=false
Displayed metrics :
# HELP collector_scrape_time_ms Time taken for scrape by collector
# TYPE collector_scrape_time_ms gauge
collector_scrape_time_ms{collector="general",exporter="mongodb"} 0
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 17
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.17.13"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 3.655088e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 3.655088e+06
[....]
# HELP mongodb_up Whether MongoDB is up.
# TYPE mongodb_up gauge
mongodb_up 1
[...]
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.35940608e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
Environment
K8S
MongoDB version : 4.2.0
Thanks in advance for any help :)

Related

Kubernetes - What happens if startupProbe runs beyond periodSeconds

I have a Deployment which runs a simple apache server. I want to execute some commands after the service is up. I am not quite sure how much time the post action commands going to take. I have "timeoutSeconds" set as more than "periodSeconds".
Kubernets Version: 1.25
apiVersion: apps/v1
kind: Deployment
metadata:
name: readiness
spec:
replicas: 1
selector:
matchLabels:
app: readiness
template:
metadata:
labels:
app: readiness
spec:
containers:
- image: sujeetkp/readiness:3.0
name: readiness
resources:
limits:
memory: "500M"
cpu: "1"
readinessProbe:
httpGet:
path: /health_monitor
port: 80
initialDelaySeconds: 20
timeoutSeconds: 10
failureThreshold: 20
periodSeconds: 10
livenessProbe:
httpGet:
path: /health_monitor
port: 80
initialDelaySeconds: 60
timeoutSeconds: 10
failureThreshold: 20
periodSeconds: 10
startupProbe:
exec:
command:
- /bin/sh
- -c
- |-
OUTPUT=$(curl -s -o /dev/null -w %{http_code} http://localhost:80/health_monitor)
if [ $? -eq 0 ] && [ $OUTPUT -ge 200 ] && [ $OUTPUT -lt 400 ]
then
echo "Success" >> /tmp/post_action_track
if [ ! -f /tmp/post_action_success ]
then
# Trigger Post Action
sleep 60
echo "Success" >> /tmp/post_action_success
fi
else
exit 1
fi
initialDelaySeconds: 20
timeoutSeconds: 80
failureThreshold: 20
periodSeconds: 10
When I run this code, I see very strange results.
As "periodSeconds" is 10 and my script has a sleep of 60 seconds, should not the start up probe trigger atleast 6 times, but it only triggers 2 times.
I am checking the contents of files /tmp/post_action_success and /tmp/post_action_track to identify how many times the probe triggers. (Count the number of success inside the files)
Question: If the previous instance of startup probe is running, then is the startupProbe triggered on top of it or not ? If yes, then why it triggered only twice in my case.
Another observation:
When I set below options
initialDelaySeconds: 20
timeoutSeconds: 5
failureThreshold: 20
periodSeconds: 10
Then the content of file /tmp/post_action_success shows sleep/timeoutSeconds (60/5)=12 "success".
Can someone please explain how this works.
I think the reason you see the probe being triggered twice is because of timeoutSeconds: 80. See this question. Also the official doc is quiet handy in explaining the other fields.
Perhaps you can set initialDelaySeconds: 61 instead of using sleep in you script?

Kafka Mirrormaker2 config optimization

I am setting up Strimzi kafka Mirrormaker2 in our test environment which receives on an average 100k messages/5 mins. we have around 25 topics and 900 partitions in total for these topics. The default configuration i set up is mirroring only 60k messages/5 mins to the DR cluster. I am trying to optimize this configuration for better throughput and latency.
apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
spec:
clusters:
- alias: source
authentication:
certificateAndKey:
certificate: user.crt
key: user.key
secretName: mirrormaker1
type: tls
bootstrapServers: bootstrap1:443
tls:
trustedCertificates:
- certificate: ca.crt
secretName: cert-source
- alias: target
authentication:
certificateAndKey:
certificate: user.crt
key: user.key
secretName: mirrormaker-dr
type: tls
bootstrapServers: bootstrap2:443
config:
offset.flush.timeout.ms: 120000
tls:
trustedCertificates:
- certificate: ca.crt
secretName: dest-cert
connectCluster: target
livenessProbe:
initialDelaySeconds: 40
periodSeconds: 40
timeoutSeconds: 30
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: mm2-metrics-config.yaml
name: mm2-metrics
mirrors:
- checkpointConnector:
config:
checkpoints.topic.replication.factor: 3
tasksMax: 10
groupsPattern: .*
heartbeatConnector:
config:
heartbeats.topic.replication.factor: 3
sourceCluster: source
sourceConnector:
config:
consumer.request.timeout.ms: 150000
offset-syncs.topic.replication.factor: 3
refresh.topics.interval.seconds: 60
replication.factor: 3
source.cluster.producer.enable.idempotence: "true"
sync.topic.acls.enabled: "true"
target.cluster.producer.enable.idempotence: "true"
tasksMax: 60
targetCluster: target
topicsPattern: .*
readinessProbe:
initialDelaySeconds: 40
periodSeconds: 40
timeoutSeconds: 30
replicas: 4
resources:
limits:
cpu: 9
memory: 30Gi
requests:
cpu: 5
memory: 15Gi
version: 2.8.0
With the above config i don't see any errors in the log files.
I tried to fine tune the config for more throughput and latency as follows
consumer.max.partition.fetch.bytes: 2097152
consumer.max.poll.records: 1000
consumer.receive.buffer.bytes: 131072
consumer.request.timeout.ms: 200000
consumer.send.buffer.bytes: 262144
offset-syncs.topic.replication.factor: 3
producer.acks: 0
producer.batch.size: 20000
producer.buffer.memory: 30331648
producer.linger.ms: 10
producer.max.request.size: 2097152
producer.message.max.bytes: 2097176
producer.request.timeout.ms: 150000
I am seeing the following errors in the logs now but the data is still flowing and see the number of messages increased slightly to around ~65k/5mins. I also increased the tasksmax count from 60 to 800 and replicas from 4 to 8 but i don't see any difference doing this.Also the N/w Bytes in is around ~20 MiB/s. Even though i further increased consumer.request.timeout.ms the below error didn't disappear..
2022-04-26 04:09:51,223 INFO [Consumer clientId=consumer-null-1601, groupId=null] Error sending fetch request (sessionId=629190882, epoch=65) to node 4: (org.apache.kafka.clients.FetchSessionHandler) [task-thread-us-ashburn-1->us-phoenix-1-dr.MirrorSourceConnector-759]
org.apache.kafka.common.errors.DisconnectException
Is there anything i can do to increase the throughput and decrease the latency?
I haven't configured Strimzi kafka Mirrormaker before, but at first look, the producer and consumer configs seem to be the same as what is exposed by the kafka-clients library. Assuming that is the case, the producer's batch.size, which is set to 20000, is not number of records. It is in bytes, which means, with this config, the producer will transmit a maximum of only 20 kilobytes per send. Try increasing it to 65,536(64 kilobytes) or higher. If the throughput still doesn't increase, increase linger.ms to 100 or higher, so that the producer waits longer for each batch to fill up before triggering a send

Keep getting error status on Kubernetes Cron Job with connection refused?

I am trying to write a cron job which hits a rest endpoint of the application it is pulling image of.
Below is the sample code:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: {{ .Chart.Name }}-cronjob
labels:
app: {{ .Release.Name }}
chart: {{ .Chart.Name }}-{{ .Chart.Version }}
release: {{ .Release.Name }}
spec:
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 2
failedJobsHistoryLimit: 2
startingDeadlineSeconds: 1800
jobTemplate:
spec:
template:
metadata:
name: {{ .Chart.Name }}-cronjob
labels:
app: {{ .Chart.Name }}
spec:
restartPolicy: OnFailure
containers:
- name: demo
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
command: ["/bin/sh", "-c", "curl http://localhost:8080/hello"]
readinessProbe:
httpGet:
path: "/healthcheck"
port: 8081
initialDelaySeconds: 300
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 3
livenessProbe:
httpGet:
path: "/healthcheck"
port: 8081
initialDelaySeconds: 300
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 3
resources:
requests:
cpu: 200m
memory: 2Gi
limits:
cpu: 1
memory: 6Gi
schedule: "*/5 * * * *"
But i keep running into *curl: (7) Failed to connect to localhost port 8080: Connection refused*.
I can see from the events that it creates the container and immediately throws: Back-off restarting failed container.
I already have pods running of demo app and it works fine, it is just when i am trying to point to this existing app and hit a rest endpoint i start running into connection refused errors.
Exact output when seeing the logs:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 8080: Connection refused
Event Logs:
Container image "wayfair/demo:728ac13-as_test_cron_job" already present on machine
9m49s Normal Created pod/demo-cronjob-1619108100-ndrnx Created container demo
6m17s Warning BackOff pod/demo-cronjob-1619108100-ndrnx Back-off restarting failed container
5m38s Normal SuccessfulDelete job/demo-cronjob-1619108100 Deleted pod: demo-cronjob-1619108100-ndrnx
5m38s Warning BackoffLimitExceeded job/demo-cronjob-1619108100 Job has reached the specified backoff limit
Being new to K8, Any pointers are helpful!
You are trying to connect to localhost:8080 with your curl which doesn't make sense from what I understand of your CronJob definition.
From the docs (at https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#define-a-command-and-arguments-when-you-create-a-pod )
The command and arguments that you define in the configuration file
override the default command and arguments provided by the container
image. If you define args, but do not define a command, the default
command is used with your new arguments.
Note: The command field corresponds to entrypoint in some container
runtimes. Refer to the Notes below.
If you define a command for the image, even if the image would start a rest application on port 8080 on localhost with its default entrypoint (or command, depends on the container type you are using), the command overrides the entrypoint and no application is start.
If you have the necessity of both starting the application and then performing other operations, like curls and so on, I suggest to use a .sh script or something like that, depending on what is the Job objective.

Docker Compose health check of HTTP API using tools outside the container

I am implementing a Docker Compose health check for Prysm Docker container. Prysm is Ethereum 2 node.
My goal is to ensure that RPC APIs (gRPC, JSON-RPC) of Prysm are up before starting other services in the same Docker Compose file, as those services depend on Prysm. I can use depends_on of Docker Compose file for this, but I need to figure out how to construct a check that checks if Prysm HTTP ports are ready to accept traffic.
The equivalent Kubernetes health check is:
readinessProbe:
initialDelaySeconds: 180
timeoutSeconds: 1
periodSeconds: 60
failureThreshold: 3
successThreshold: 1
httpGet:
path: /healthz
port: 9090
scheme: HTTP
livenessProbe:
initialDelaySeconds: 60
timeoutSeconds: 1
periodSeconds: 60
failureThreshold: 60
successThreshold: 1
httpGet:
path: /healthz
port: 9090
scheme: HTTP
The problem with Prysm image is that it lacks normal UNIX tools within the image (curl, netcat, /bin/sh) one usually uses to create such checks.
Is there a way to implement an HTTP health check with Docker Compose that would use built-in features in compose (are there any) or commands from the host system instead of ones within the container?
I managed to accomplish this by creating another service using Dockerize image.
version: '3'
services:
# Oracle connects to ETH1 and ETH2 nodes
# oracle:
stakewise:
container_name: stakewise-oracle
image: stakewiselabs/oracle:v1.0.1
# Do not start oracle service until beacon health check succeeds
depends_on:
beacon_ready:
condition: service_healthy
# ETH2 Prysm node
beacon:
container_name: eth2-beacon
image: gcr.io/prysmaticlabs/prysm/beacon-chain:latest
restart: always
hostname: beacon-chain
# An external startup check tool for Prysm
# Using https://github.com/jwilder/dockerize
# Simply wait that TCP port of RPC becomes available before
# starting the Oracle to avoid errors on the startup.
beacon_ready:
image: jwilder/dockerize
container_name: eth2-beacon-ready
command: "/bin/sh -c 'while true ; do dockerize -wait tcp://beacon-chain:3500 -timeout 300s ; sleep 99 ; done'"
depends_on:
- beacon
healthcheck:
test: ["CMD", "dockerize", "-wait", "tcp://beacon-chain:3500"]
interval: 1s
retries: 999

What would be the opa policy in .rego for the following examples?

I am new to opa and k8s, i dont have much knowledge or experience in this field. i would like to have policy in rego code (opa policy) and execute to see the result.
the following examples are:
Always Pull Images - Ensure every container sets its ‘imagePullPolicy’ to ‘Always’
Check for Liveness Probe - Ensure every container sets a livenessProbe
Check for Readiness Probe - Ensure every container sets a readinessProbe
for the following, i would like have an opa policy:
1.Always Pull Images:
apiVersion: v1
kind: Pod
metadata:
name: test-image-pull-policy
spec:
containers:
- name: nginx
image: nginx:1.13
imagePullPolicy: IfNotPresent
2.Check for Liveness Probe
3.Check for Readiness Probe
containers:
- name: opa
image: openpolicyagent/opa:latest
ports:
- name: http
containerPort: 8181
args:
- "run"
- "--ignore=.*" # exclude hidden dirs created by Kubernetes
- "--server"
- "/policies"
volumeMounts:
- readOnly: true
mountPath: /policies
name: example-policy
livenessProbe:
httpGet:
scheme: HTTP # assumes OPA listens on localhost:8181
port: 8181
initialDelaySeconds: 5 # tune these periods for your environemnt
periodSeconds: 5
readinessProbe:
httpGet:
path: /health?bundle=true # Include bundle activation in readiness
scheme: HTTP
port: 8181
initialDelaySeconds: 5
periodSeconds: 5
Is there any way to create the opa policy for the above conditions. Could any one help as i am new to opa. Thanks in advance.
For the liveness and readiness probe checks, you can simply test if those fields are defined:
package kubernetes.admission
deny["container is missing livenessProbe"] {
container := input_container[_]
not container.livenessProbe
}
deny["container is missing readinessProbe"] {
container := input_container[_]
not container.readinessProbe
}
input_container[container] {
container := input.request.object.spec.containers[_]
}
#Always Pull Images
package kubernetes.admission
deny[msg] {
input.request.kind.kind = "Pod"
container = input.request.object.spec.containers[_]
container.imagePullPolicy != "Always"
msg = sprintf("Forbidden imagePullPolicy value \"%v\"", [container.imagePullPolicy])
}