Open-telemetry auto instrumentation does not work without sidecar - kubernetes

I work at a startup, and we recently migrated our workloads to use Kubernetes, specifically we are running inside a cluster in EKS (AWS).
I'm currently trying to implement a observability stack on our cluster. I'm running Signoz on a separate EC2 instance (for tests, and because our cluster uses small machines that are not supported by their helm chart).
In the cluster, I'm running the Open Telemetry Operator, and have managed to deploy a Collector in deployment mode, and have validated that it is able to connect to the signoz instance. However, when I try to auto instrument my applications, I'm not able to do it without using sidecars.
My manifest file for the elements above is below.
apiVersion: v1
kind: Namespace
metadata:
name: opentelemetry
labels:
name: opentelemetry
---
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: opentelemetry
spec:
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
batch:
send_batch_size: 10000
timeout: 10s
exporters:
otlp:
endpoint: obs.stg.company.domain:4317
tls:
insecure: true
logging:
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp, logging]
logs:
receivers: [otlp]
processors: []
exporters: [otlp, logging]
---
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: my-instrumentation
namespace: opentelemetry
spec:
exporter:
endpoint: http://otel-collector-collector.opentelemetry.svc.cluster.local:4317
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased_traceidratio
argument: "0.25"
dotnet:
nodejs:
When I apply the annotation instrumentation.opentelemetry.io/inject-dotnet=opentelemetry/auto-instrumentation to the deployment of the application, or even to the namespace, and delete the pod (so it is recreated), I can see that the init container for dotnet auto instrumentation runs without a problem, but I get no traces, metrics or logs, either on the Collector or in Signoz.
If I create another collector in sidecar mode, like the one below, point the instrumentation to this collector, and also apply annotation sidecar.opentelemetry.io/inject=sidecar to the namespace, everything works fine.
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: sidecar
namespace: application
spec:
mode: sidecar
config: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
logging:
otlp:
endpoint: "http://otel-collector-collector.opentelemetry.svc.cluster.local:4317"
tls:
insecure: true
service:
telemetry:
logs:
level: "debug"
pipelines:
traces:
receivers: [otlp]
processors: []
exporters: [logging, otlp]
The reason I am trying to do it without sidecars is that, as I said before, we have quite a small cluster, and would like to keep overhead to a minimum.
So, I would like first to understand if I should even be worried about sidecars, if their overhead is measurably different than not using them.
And second, I would like to understand what went wrong with my config, since I believe I followed all the instructions in Signoz's documentation.
Thank you for any help you guys can provide.

Related

How to change queue.mem.events in Kibana ECK operator APM server?

i've been getting this 503 queue full errors in kibana APM server.
{"request_id": "1854b2e8-2930-4895-ac96-a213e1253215", "method": "POST", "URL": "/intake/v2/events", "content_length": -1, "remote_address": "10.36.0.128", "user-agent": "elasticapm-node/3.6.1 elastic-apm-http-client/9.4.0 node/10.17.0", "response_code": 503, "error": "queue is full"}
Looking into the documentation they suggest to change queue.mem.events and output.elasticsearch.bulk_max_size besides horizontally or vertically scaling the server. The problem is i havent find the way to tune this variables in the ECK ApmServer operator yaml:
Tried this:
---
apiVersion: apm.k8s.elastic.co/v1
kind: ApmServer
metadata:
name: apm-server
namespace: apm
spec:
version: 7.5.1
count: 12
elasticsearchRef:
name: "elasticsearch"
config:
queue.mem.events: 8198
output.elasticsearch.bulk_max_size: 8198
apm-server:
rum.enabled: true
ilm.enabled: true
rum.event_rate.limit: 1000
rum.event_rate.lru_size: 1000
rum.allow_origins: ['']
rum.library_pattern: "node_modules|bower_components|~"
rum.exclude_from_grouping: "^/webpack"
rum.source_mapping.enabled: true
rum.source_mapping.cache.expiration: 5m
rum.source_mapping.index_pattern: "apm--sourcemap*"
http:
service:
spec:
type: LoadBalancer
tls:
selfSignedCertificate:
disabled: true
Also, added a few nodes to the ElasticSearch cluster, but still, the variable doesn't seem to be set correctly because a few minutes later the APM queue gets full again and the errors reappear.
So, how can i change these variables in the ApmServer Kiubernetetes CRD?

Route change apply in Istio too slow and make deployment failed

I am working on DevOps solution, and try to automate the blue-green deployment solution on kubernetes. However, we are facing the issue that the istio apply the route rules too slow, when removing the virtualservices and take a long time to effective. We tried to wait 60s to wait the rules updated and destroy the old pods. We don't have ideas that 60s is enough to finish the route change, and will have downtime if over 60s to take effective. Would like to get some advises on how to check the route (to green one only ) is updated properly? and how to make the istio apply to execute faster? Thanks.
Here is the yaml file to apply the virtualservice:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
namespace: xxx-d
name: xxx-virtualservice
labels:
microservice: xxx-new
spec:
hosts:
- xxx.com
gateways:
- mesh
- http-gateway.istio-system.svc.cluster.local
- https-gateway.istio-system.svc.cluster.local
http:
- headers:
request:
set:
x-forwarded-port: '443'
x-forwarded-proto: https
route:
- destination:
host: xxx-service.svc.cluster.local
port:
number: 8080
retries:
attempts: 3
retryOn: gateway-error,connect-failure,refused-stream
timeout: 3s

ArgoCD Helm chart - Repository not accessible

I'm trying to add a helm chart (https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) to ArgoCD.
When I do this, I get the following error:
Unable to save changes: application spec is invalid: InvalidSpecError: repository not accessible: repository not found
Can you guys help me out please? I think I did everything right but it seems something's wrong...
Here's the Project yaml.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prom-oper
namespace: argocd
spec:
project: prom-oper
source:
repoURL: https://prometheus-community.github.io/helm-charts
targetRevision: "13.2.1"
path: prometheus-community/kube-prometheus-stack
helm:
# Release name override (defaults to application name)
releaseName: prom-oper
version: v3
values: |
... redacted
directory:
recurse: false
destination:
server: https://kubernetes.default.svc
namespace: prom-oper
syncPolicy:
automated: # automated sync by default retries failed attempts 5 times with following delays between attempts ( 5s, 10s, 20s, 40s, 80s ); retry controlled using `retry` field.
prune: false # Specifies if resources should be pruned during auto-syncing ( false by default ).
selfHeal: false # Specifies if partial app sync should be executed when resources are changed only in target Kubernetes cluster and no git change detected ( false by default ).
allowEmpty: false # Allows deleting all application resources during automatic syncing ( false by default ).
syncOptions: # Sync options which modifies sync behavior
- CreateNamespace=true # Namespace Auto-Creation ensures that namespace specified as the application destination exists in the destination cluster.
# The retry feature is available since v1.7
retry:
limit: 5 # number of failed sync attempt retries; unlimited number of attempts if less than 0
backoff:
duration: 5s # the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")
factor: 2 # a factor to multiply the base duration after each failed retry
maxDuration: 3m # the maximum amount of time allowed for the backoff strategy
and also the configmap where I added the helm repo
apiVersion: v1
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/name: argocd-cm
app.kubernetes.io/part-of: argocd
name: argocd-cm
namespace: argocd
data:
admin.enabled: "false"
repositories: |
- type: helm
url: https://prometheus-community.github.io/helm-charts
name: prometheus-community
The reason you are getting this error is because the way the Application is defined, Argo thinks it's a Git repository instead of Helm.
Define the source object with a "chart" property instead of "path" like so:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prom-oper
namespace: argocd
spec:
project: prom-oper
source:
repoURL: https://prometheus-community.github.io/helm-charts
targetRevision: "13.2.1"
chart: kube-prometheus-stack
You can see it defined on line 128 in Argo's application-crd.yaml

How to specify a GKE node pool configuration in a YAML file instead of using gcloud container node-pools create?

It seems that the only way to create node pools on Google Kubernetes Engine is with the command gcloud container node-pools create. I would like to have all the configuration in a YAML file instead. What I tried is the following:
apiVersion: v1
kind: NodeConfig
metadata:
annotations:
cloud.google.com/gke-nodepool: ares-pool
spec:
diskSizeGb: 30
diskType: pd-standard
imageType: COS
machineType: n1-standard-1
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
But kubectl apply fails with:
error: unable to recognize "ares-pool.yaml": no matches for kind "NodeConfig" in version "v1"
I am surprised that Google yields almost no relevant results for all my searches. The only documentation that I found was the one on Google Cloud, which is quite incomplete in my opinion.
Node pools are not Kubernetes objects, they are part of the Google Cloud API. Therefore Kubernetes does not know about them, and kubectl apply will not work.
What you actually need is a solution called "infrastructure as code" - a code that will tell GCP what kind of node pool it wants.
If you don't strictly need YAML, you can check out Terraform that handles this use case. See: https://terraform.io/docs/providers/google/r/container_node_pool.html
You can also look into Google Deployment Manager or Ansible (it has GCP module, and uses YAML syntax), they also address your need.
I don' know if it answers accurately your needs but if you want to do IAC in general with Kubernetes, you can use Crossplane CRDs. If you already have a running cluster, you just have to install their helm chart and you can provision a cluster this way:
apiVersion: container.gcp.crossplane.io/v1beta1
kind: GKECluster
metadata:
name: gke-crossplane-cluster
spec:
forProvider:
initialClusterVersion: "1.19"
network: "projects/development-labs/global/networks/opsnet"
subnetwork: "projects/development-labs/regions/us-central1/subnetworks/opsnet"
ipAllocationPolicy:
useIpAliases: true
defaultMaxPodsConstraint:
maxPodsPerNode: 110
And then you can define an associated node pool as follows:
apiVersion: container.gcp.crossplane.io/v1alpha1
kind: NodePool
metadata:
name: gke-crossplane-np
spec:
forProvider:
autoscaling:
autoprovisioned: false
enabled: true
maxNodeCount: 2
minNodeCount: 1
clusterRef:
name: gke-crossplane-cluster
config:
diskSizeGb: 100
# diskType: pd-ssd
imageType: cos_containerd
labels:
test-label: crossplane-created
machineType: n1-standard-4
oauthScopes:
- "https://www.googleapis.com/auth/devstorage.read_only"
- "https://www.googleapis.com/auth/logging.write"
- "https://www.googleapis.com/auth/monitoring"
- "https://www.googleapis.com/auth/servicecontrol"
- "https://www.googleapis.com/auth/service.management.readonly"
- "https://www.googleapis.com/auth/trace.append"
initialNodeCount: 2
locations:
- us-central1-a
management:
autoRepair: true
autoUpgrade: true
If you want you can find a full example of a GKE provisionning with Crossplane here.

Horizontal Pod Autoscaler (HPA) on Google Kubernetes Engine (GKE) using Backend Latency from an Ingress LoadBalancer via Stackdriver External Metric

I'm trying to configure a Horizontal Pod Autoscaler (HPA) on Google Kubernetes Engine (GKE) using External Metrics from an Ingress LoadBalancer, basing the configuration on instructions such as
https://cloud.google.com/kubernetes-engine/docs/tutorials/external-metrics-autoscaling
and
https://blog.doit-intl.com/autoscaling-k8s-hpa-with-google-http-s-load-balancer-rps-stackdriver-metric-92db0a28e1ea
With an HPA like
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: my-api
namespace: production
spec:
minReplicas: 1
maxReplicas: 20
metrics:
- external:
metricName: loadbalancing.googleapis.com|https|request_count
metricSelector:
matchLabels:
resource.labels.forwarding_rule_name: k8s-fws-production-lb-my-api--63e2a8ddaae70
targetAverageValue: "1"
type: External
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-api
the autoscaler does kick in when the request count rises - but placing heavy load on the service, like 100 simultaneous requests per second, doesn't increase the external metric request_count much beyond 6 RPS, while the observed backend_latencies metric in Stackdriver does increase significantly; so I'd like to utilise that metric by adding to the HPA configuration, like so:
- external:
metricName: loadbalancing.googleapis.com|https|backend_latencies
metricSelector:
matchLabels:
resource.labels.forwarding_rule_name: k8s-fws-production-lb-my-api--63e2a8ddaae70
targetValue: "3000"
type: External
but that results in the error:
...unable to fetch metrics from external metrics API: googleapi: Error 400: Field aggregation.perSeriesAligner had an invalid value of "ALIGN_RATE": The aligner cannot be applied to metrics with kind DELTA and value type DISTRIBUTION., badRequest
which can be observed with the command
$ kubectl describe hpa -n production
or by visiting
http://localhost:8080/apis/external.metrics.k8s.io/v1beta1/namespaces/default/loadbalancing.googleapis.com%7Chttps%7Cbackend_latencies
after setting up a proxy with
$ kubectl proxy --port=8080
Are https/backend_latencies or https/total_latencies not supported as External Stackdriver Metrics in an HPA configuration for GKE?
Maybe someone would find this helpful, though the question is old.
My working config looks like next:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 95
- type: External
external:
metric:
name: loadbalancing.googleapis.com|https|backend_latencies
selector:
matchLabels:
resource.labels.backend_name: frontend
metric.labels.proxy_continent: Europe
reducer: REDUCE_PERCENTILE_95
target:
type: Value
value: "79.5"
type: Value used because it's the only way to not divide the metric value by the replica number.
reducer: REDUCE_PERCENTILE_95 used to work only with a single value of the distribution (source).
Also, I edited custom-metrics-stackdriver-adapter deployment to look like this:
- image: gcr.io/gke-release/custom-metrics-stackdriver-adapter:v0.12.2-gke.0
imagePullPolicy: Always
name: pod-custom-metrics-stackdriver-adapter
command:
- /adapter
- --use-new-resource-model=true
- --fallback-for-container-metrics=true
- --enable-distribution-support=true
The thing is this key enable-distribution-support=true, which enables working with distribution kind of metrics.