Ngnix ingress controller Modsecurity (OWASP Ruleset) high latency response - kubernetes
In our AWS EKS environment, I deployed Nginx ingress controller through helm, following the official Nginx install guide and adding a configmap yaml that enables Waf modsecurity in this ingress with OWASP v3.3.0 ruleset. It stands behind aws nlb
It seems that the petitions we are madding now to the environment are being processing with a high latency, but this happen when you make the first petition from same IP, after that, the following ones are working good.
nginx-values.yaml
---
controller:
config:
use-proxy-protocol: true
enable-modsecurity: true
ssl-protocols: "TLSv1.2 TLSv1.3"
ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384"
service:
enableHttps: true
enableHttp: false
type: LoadBalancer
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-proxy-protocol: "*"
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: 3
service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: 2
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: 2
service.beta.kubernetes.io/load-balancer-source-ranges: ${source_range}
metrics:
enabled: true
extraInitContainers:
- name: init
image: alpine:3
command: ["/bin/sh","-c"]
args: ["ls -tla /opt/modsecurity/var; chown -R 101:101 /opt/modsecurity/var; ls -tla /opt/modsecurity/var; touch /opt/modsecurity/var/log/debug.log; chown -R 101:101 /opt/modsecurity/var"]
volumeMounts:
- name: log
mountPath: /opt/modsecurity/var/log
securityContext:
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
privileged: true
extraContainers:
- name: promtail
image: grafana/promtail
args:
- -config.file=/etc/config-waf/promtail.yaml
volumeMounts:
- name: config-map
mountPath: /etc/config-waf
- name: log
mountPath: /opt/modsecurity/var/log
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
extraVolumeMounts:
- name: config-map
mountPath: /etc/nginx/modsecurity
- name: log
mountPath: /opt/modsecurity/var/log
extraVolumes:
- name: config-map
configMap:
name: waf-config
- name: log
emptyDir: {}
- name: audit
emptyDir: {}
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
autoscalingTemplate:
- type: Pods
pods:
metric:
name: nginx_ingress_controller_nginx_process_requests_total
target:
type: AverageValue
averageValue: 10000m
defaultBackend:
enabled: true
waf.conf
SecRuleEngine On
SecRequestBodyAccess On
SecRule REQUEST_HEADERS:Content-Type "(?:application(?:/soap\+|/)|text/)xml" \
"id:'200000',phase:1,t:none,t:lowercase,pass,nolog,ctl:requestBodyProcessor=XML"
SecRule REQUEST_HEADERS:Content-Type "application/json" \
"id:'200001',phase:1,t:none,t:lowercase,pass,nolog,ctl:requestBodyProcessor=JSON"
SecRequestBodyLimit 13107200
SecRequestBodyNoFilesLimit 131072
SecRequestBodyLimitAction Reject
SecRule REQBODY_ERROR "!#eq 0" \
"id:'200002', phase:2,t:none,log,deny,status:400,msg:'Failed to parse request body.',logdata:'%{reqbody_error_msg}',severity:2"
SecRule MULTIPART_STRICT_ERROR "!#eq 0" \
"id:'200003',phase:2,t:none,log,deny,status:400, \
msg:'Multipart request body failed strict validation: \
PE %{REQBODY_PROCESSOR_ERROR}, \
BQ %{MULTIPART_BOUNDARY_QUOTED}, \
BW %{MULTIPART_BOUNDARY_WHITESPACE}, \
DB %{MULTIPART_DATA_BEFORE}, \
DA %{MULTIPART_DATA_AFTER}, \
HF %{MULTIPART_HEADER_FOLDING}, \
LF %{MULTIPART_LF_LINE}, \
SM %{MULTIPART_MISSING_SEMICOLON}, \
IQ %{MULTIPART_INVALID_QUOTING}, \
IP %{MULTIPART_INVALID_PART}, \
IH %{MULTIPART_INVALID_HEADER_FOLDING}, \
FL %{MULTIPART_FILE_LIMIT_EXCEEDED}'"
SecRule MULTIPART_UNMATCHED_BOUNDARY "#eq 1" \
"id:'200004',phase:2,t:none,log,deny,msg:'Multipart parser detected a possible unmatched boundary.'"
SecPcreMatchLimit 1000
SecPcreMatchLimitRecursion 1000
SecRule TX:/^MSC_/ "!#streq 0" \
"id:'200005',phase:2,t:none,deny,msg:'ModSecurity internal error flagged: %{MATCHED_VAR_NAME}'"
SecResponseBodyAccess On
SecResponseBodyMimeType text/plain text/html text/xml
SecResponseBodyLimit 524288
SecResponseBodyLimitAction ProcessPartial
SecTmpDir /tmp/
SecDataDir /tmp/
SecDebugLog /opt/modsecurity/var/log/debug.log
SecDebugLogLevel 3
SecAuditEngine Off
SecAuditLogRelevantStatus "^(?:5|4(?!04))"
SecAuditLogParts ABIJDEFHZ
SecAuditLogType Serial
SecAuditLog /opt/modsecurity/var/audit/modsec_audit.log
SecArgumentSeparator &
SecCookieFormat 0
SecUnicodeMapFile unicode.mapping 20127
SecStatusEngine On
crs-setup.conf
SecDefaultAction "phase:1,log,auditlog,pass,status:408"
SecDefaultAction "phase:2,log,auditlog,pass,status:408"
SecAction \
"id:900000,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:tx.paranoia_level=1"
SecAction \
"id:900100,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:tx.critical_anomaly_score=5,\
setvar:tx.error_anomaly_score=4,\
setvar:tx.warning_anomaly_score=3,\
setvar:tx.notice_anomaly_score=2"
SecAction \
"id:900110,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:tx.inbound_anomaly_score_threshold=10000,\
setvar:tx.outbound_anomaly_score_threshold=10000"
SecAction \
"id:900700,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:'tx.dos_burst_time_slice=30',\
setvar:'tx.dos_counter_threshold=250',\
setvar:'tx.dos_block_timeout=300'"
SecAction \
"id:900960,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:tx.do_reput_block=1"
SecAction \
"id:900970,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:tx.reput_block_duration=300"
SecCollectionTimeout 600
SecAction \
"id:900990,\
phase:1,\
nolog,\
pass,\
t:none,\
setvar:tx.crs_setup_version=330"
Any thoughts on this?
What do you mean by 'high latency'? Is it affecting all requests or only specific ones? Have you tried disabling DoS protection in crs-setup.conf?
Related
Getting error in Kubernetes cronjob while using google cloud sdk to upload data on GCS bucket
**Yaml for kubernetes that is first used to create raft backup and then upload into gas bucket** apiVersion: batch/v1beta1 kind: CronJob metadata: labels: app.kubernetes.io/component: raft-backup numenapp: raft-backup name: raft-backup namespace: raft-backup spec: concurrencyPolicy: Forbid failedJobsHistoryLimit: 3 jobTemplate: spec: template: metadata: annotations: vault.security.banzaicloud.io/vault-addr: https://vault.vault-internal.net:8200 labels: app.kubernetes.io/component: raft-backup spec: containers: - args: - | SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token); export VAULT_TOKEN=$(vault write -field=token auth/kubernetes/login jwt=$SA_TOKEN role=raft-backup); vault operator raft snapshot save /share/vault-raft.snap; echo "snapshot is success" command: ["/bin/sh", "-c"] env: - name: VAULT_ADDR value: https://vault.vault-internl.net:8200 image: vault:1.10.9 imagePullPolicy: Always name: snapshot volumeMounts: - mountPath: /share name: share - args: - -ec - sleep 500 - "until [ -f /share/vault-raft.snap ]; do sleep 5; done;\ngsutil cp /share/vault-raft.snap\ \ gs://raft-backup/vault_raft_$(date +\"\ %Y%m%d_%H%M%S\").snap;\n" command: - /bin/sh image: gcr.io/google.com/cloudsdktool/google-cloud-cli:latest imagePullPolicy: IfNotPresent name: upload securityContext: allowPrivilegeEscalation: false volumeMounts: - mountPath: /share name: share restartPolicy: OnFailure securityContext: fsGroup: 1000 runAsGroup: 1000 runAsUser: 1000 serviceAccountName: raft-backup volumes: - emptyDir: {} name: share schedule: '*/3 * * * *' startingDeadlineSeconds: 60 successfulJobsHistoryLimit: 3 suspend: false Error while running gsutil command inside the upload pod $ gsutil Traceback (most recent call last): File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/configurations/named_configs.py", line 172, in ActiveConfig return ActiveConfig(force_create=True) File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/configurations/named_configs.py", line 492, in ActiveConfig config_name = _CreateDefaultConfig(force_create) File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/configurations/named_configs.py", line 640, in _CreateDefaultConfig file_utils.MakeDir(paths.named_config_directory) File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 125, in MakeDir os.makedirs(path, mode=mode) File "/usr/bin/../lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/os.py", line 215, in makedirs makedirs(head, exist_ok=exist_ok) File "/usr/bin/../lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/os.py", line 215, in makedirs makedirs(head, exist_ok=exist_ok) File "/usr/bin/../lib/google-cloud-sdk/platform/bundledpythonunix/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 30] Read-only file system: '/home/cloudsdk/.config' $ command terminated with exit code 137
OSError: [Errno 30] Read-only file system: '/home/cloudsdk/.config' $ command terminated with exit code 137 It seems you don't give enought permission in your cronJob. Try to change : securityContext: fsGroup: 1000 runAsGroup: 1000 runAsUser: 1000 by : securityContext: privileged: true Tell me if it works or not and we can discuss about it. Edit for complete response : Use this apiVersion: batch/v1 instead of apiVersion: batch/v1beta1
cannot reach grafana loki port with http using traefik
I have been trying to find solutions to this but no luck. all services work internally. I am able to access grafana from browser with tls enabled but I am not able to reach loki port in any way(browser/postman etc.) but I can. I can access to loki api with curl localy if I open port on for the service. but as I understand you need to expose ports from traefik to do that. My compose file: version: "3" services: grafana: labels: - "traefik.http.routers.grafana.entryPoints=port80" - "traefik.http.routers.grafana.rule=host(`${DOMAIN}`)" - "traefik.http.middlewares.grafana-redirect.redirectScheme.scheme=https" - "traefik.http.middlewares.grafana-redirect.redirectScheme.permanent=true" - "traefik.http.routers.grafana.middlewares=grafana-redirect" # SSL endpoint - "traefik.http.routers.grafana-ssl.entryPoints=port443" - "traefik.http.routers.grafana-ssl.rule=host(`${DOMAIN}`)" - "traefik.http.routers.grafana-ssl.tls=true" - "traefik.http.routers.grafana-ssl.tls.certResolver=le-ssl" - "traefik.http.routers.grafana-ssl.service=grafana-ssl" - "traefik.http.services.grafana-ssl.loadBalancer.server.port=3000" image: grafana/grafana:latest # or probably any other version volumes: - grafana-data:/var/lib/grafana environment: - GF_SERVER_ROOT_URL=https://${DOMAIN} - GF_SERVER_DOMAIN=${DOMAIN} - GF_USERS_ALLOW_SIGN_UP=false - GF_SECURITY_ADMIN_USER=${GRAFANAUSER} - GF_SECURITY_ADMIN_PASSWORD=${GRAFANAPASS} networks: - traefik-net loki: image: grafana/loki labels: - "traefik.http.routers.loki-ssl.entryPoints=port3100" - "traefik.http.routers.loki-ssl.rule=host(`${DOMAIN}`)" - "traefik.http.routers.loki-ssl.tls=true" - "traefik.http.routers.loki-ssl.tls.certResolver=le-ssl" - "traefik.http.routers.loki-ssl.service=loki-ssl" - "traefik.http.services.loki-ssl.loadBalancer.server.port=3100" command: -config.file=/etc/loki/config.yaml volumes: - ./loki/config.yml:/etc/loki/config.yaml - loki:/data/loki networks: - traefik-net promtail: image: grafana/promtail:2.3.0 volumes: - /var/log:/var/log - ./promtail:/etc/promtail-config/ command: -config.file=/etc/promtail-config/promtail.yml networks: - traefik-net influx: image: influxdb:1.7 # or any other recent version labels: # SSL endpoint - "traefik.http.routers.influx-ssl.entryPoints=port8086" - "traefik.http.routers.influx-ssl.rule=host(`${DOMAIN}`)" - "traefik.http.routers.influx-ssl.tls=true" - "traefik.http.routers.influx-ssl.tls.certResolver=le-ssl" - "traefik.http.routers.influx-ssl.service=influx-ssl" - "traefik.http.services.influx-ssl.loadBalancer.server.port=8086" restart: always volumes: - influx-data:/var/lib/influxdb environment: - INFLUXDB_DB=grafana # set any other to create database on initialization - INFLUXDB_HTTP_ENABLED=true - INFLUXDB_HTTP_AUTH_ENABLED=true - INFLUXDB_ADMIN_USER=&{DB_USER} - INFLUXDB_ADMIN_PASSWORD=&{DB_PASS} networks: - traefik-net traefik: image: traefik:v2.9.1 ports: - "80:80" - "443:443" - "3100:3100" # expose port below only if you need access to the Traefik API - "8080:8080" command: # - "--log.level=DEBUG" - "--api=true" - "--api.dashboard=true" - "--providers.docker=true" - "--entryPoints.port443.address=:443" - "--entryPoints.port80.address=:80" - "--entryPoints.port8086.address=:8086" - "--entryPoints.port3100.address=:3100" - "--certificatesResolvers.le-ssl.acme.tlsChallenge=true" - "--certificatesResolvers.le-ssl.acme.email=${TLS_MAIL}" - "--certificatesResolvers.le-ssl.acme.storage=/letsencrypt/acme.json" volumes: - traefik-data:/letsencrypt/ - /var/run/docker.sock:/var/run/docker.sock networks: - traefik-net volumes: traefik-data: grafana-data: influx-data: loki: networks: traefik-net: loki conf # (default configuration) auth_enabled: false server: http_listen_port: 3100 ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 final_sleep: 0s chunk_idle_period: 1h # Any chunk not receiving new logs in this time will be flushed max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m) max_transfer_retries: 0 # Chunk transfers disabled wal: enabled: true dir: /loki/wal common: ring: instance_addr: 0.0.0.0 kvstore: store: inmemory schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: active_index_directory: /loki/boltdb-shipper-active cache_location: /loki/boltdb-shipper-cache cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space shared_store: filesystem filesystem: directory: /loki/chunks compactor: working_directory: /loki/boltdb-shipper-compactor shared_store: filesystem limits_config: reject_old_samples: true reject_old_samples_max_age: 168h ingestion_burst_size_mb: 16 ingestion_rate_mb: 16 chunk_store_config: max_look_back_period: 0s table_manager: retention_deletes_enabled: false retention_period: 0s ruler: storage: type: local local: directory: /loki/rules rule_path: /loki/rules-temp alertmanager_url: localhost ring: kvstore: store: inmemory enable_api: true
How to replace Kubernetes YAML manifests fields with sed?
I am trying to inject an argument --insecure-port=0 into /etc/kubernetes/manifests/kube-apiserver.yaml file using sed, but I am having a trouble getting the indentation correct below the argument - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key apiVersion: v1 kind: Pod metadata: annotations: kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 10.132.0.43:6443 creationTimestamp: null labels: component: kube-apiserver tier: control-plane name: kube-apiserver namespace: kube-system spec: containers: - command: - kube-apiserver - --advertise-address=10.132.0.43 - --allow-privileged=true - --authorization-mode=Node,RBAC - --client-ca-file=/etc/kubernetes/pki/ca.crt - --enable-admission-plugins=NodeRestriction - --enable-bootstrap-token-auth=true - --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt - --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt - --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key - --etcd-servers=https://127.0.0.1:2379 - --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt - --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt - --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key - --requestheader-allowed-names=front-proxy-client - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt - --requestheader-extra-headers-prefix=X-Remote-Extra- - --requestheader-group-headers=X-Remote-Group - --requestheader-username-headers=X-Remote-User - --secure-port=6443 - --service-account-issuer=https://kubernetes.default.svc.cluster.local - --service-account-key-file=/etc/kubernetes/pki/sa.pub - --service-account-signing-key-file=/etc/kubernetes/pki/sa.key - --service-cluster-ip-range=10.96.0.0/12 - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key image: k8s.gcr.io/kube-apiserver:v1.23.4 imagePullPolicy: IfNotPresent Any help?
Single kafka pod keeps restarting
Problem Only one, single pod is failing on k8 node. Readiness and Liveness probes are indicating long resposne time, despite fact, that port is open and traffic between nodes is flowing. Termination error code is 137. It's up immediately after killing, however port 9092 is not opened yet, and whole recreation process of bringing app up is taking about 30 minutes. General environment config We have k8 cluster consists of 6 nodes (2 nodes per each of 3 racks and racks are kept in different DC). Our kafka is deployed with helm in use, and each of its nodes is deployed on different host, due to affinity/anti-affinity. Kafka config - log.cleaner.min.compaction.lag.ms=0 - offsets.topic.num.partitions=50 - log.flush.interval.messages=9223372036854775807 - controller.socket.timeout.ms=30000 - principal.builder.class=null - log.flush.interval.ms=null - min.insync.replicas=1 - num.recovery.threads.per.data.dir=1 - sasl.mechanism.inter.broker.protocol=GSSAPI - fetch.purgatory.purge.interval.requests=1000 - replica.socket.timeout.ms=30000 - message.max.bytes=1048588 - max.connection.creation.rate=2147483647 - connections.max.reauth.ms=0 - log.flush.offset.checkpoint.interval.ms=60000 - zookeeper.clientCnxnSocket=null - quota.window.num=11 - zookeeper.connect=zookeeper-service.kafka-shared-cluster.svc.cluster.local:2181/kafka - authorizer.class.name= - password.encoder.secret=null - num.replica.fetchers=1 - alter.log.dirs.replication.quota.window.size.seconds=1 - log.roll.jitter.hours=0 - password.encoder.old.secret=null - log.cleaner.delete.retention.ms=86400000 - queued.max.requests=500 - log.cleaner.threads=1 - sasl.kerberos.service.name=null - socket.request.max.bytes=104857600 - log.message.timestamp.type=CreateTime - connections.max.idle.ms=600000 - zookeeper.set.acl=false - delegation.token.expiry.time.ms=86400000 - session.timeout.ms=null - max.connections=2147483647 - transaction.state.log.num.partitions=50 - listener.security.protocol.map=PLAINTEXT:PLAINTEXT,OUTSIDE:PLAINTEXT - log.retention.hours=168 - client.quota.callback.class=null - delete.records.purgatory.purge.interval.requests=1 - log.roll.ms=null - replica.high.watermark.checkpoint.interval.ms=5000 - replication.quota.window.size.seconds=1 - sasl.kerberos.ticket.renew.window.factor=0.8 - zookeeper.connection.timeout.ms=18000 - metrics.recording.level=INFO - password.encoder.cipher.algorithm=AES/CBC/PKCS5Padding - replica.selector.class=null - max.connections.per.ip=2147483647 - background.threads=10 - quota.consumer.default=9223372036854775807 - request.timeout.ms=30000 - log.message.format.version=2.8-IV1 - sasl.login.class=null - log.dir=/tmp/kafka-logs - log.segment.bytes=1073741824 - replica.fetch.response.max.bytes=10485760 - group.max.session.timeout.ms=1800000 - port=9092 - log.segment.delete.delay.ms=60000 - log.retention.minutes=null - log.dirs=/kafka - controlled.shutdown.enable=true - socket.connection.setup.timeout.max.ms=30000 - log.message.timestamp.difference.max.ms=9223372036854775807 - password.encoder.key.length=128 - sasl.login.refresh.min.period.seconds=60 - transaction.abort.timed.out.transaction.cleanup.interval.ms=10000 - sasl.kerberos.kinit.cmd=/usr/bin/kinit - log.cleaner.io.max.bytes.per.second=1.7976931348623157E308 - auto.leader.rebalance.enable=true - leader.imbalance.check.interval.seconds=300 - log.cleaner.min.cleanable.ratio=0.5 - replica.lag.time.max.ms=30000 - num.network.threads=3 - sasl.client.callback.handler.class=null - metrics.num.samples=2 - socket.send.buffer.bytes=102400 - password.encoder.keyfactory.algorithm=null - socket.receive.buffer.bytes=102400 - replica.fetch.min.bytes=1 - broker.rack=null - unclean.leader.election.enable=false - offsets.retention.check.interval.ms=600000 - producer.purgatory.purge.interval.requests=1000 - metrics.sample.window.ms=30000 - log.retention.check.interval.ms=300000 - sasl.login.refresh.window.jitter=0.05 - leader.imbalance.per.broker.percentage=10 - controller.quota.window.num=11 - advertised.host.name=null - metric.reporters= - quota.producer.default=9223372036854775807 - auto.create.topics.enable=false - replica.socket.receive.buffer.bytes=65536 - replica.fetch.wait.max.ms=500 - password.encoder.iterations=4096 - default.replication.factor=1 - sasl.kerberos.principal.to.local.rules=DEFAULT - log.preallocate=false - transactional.id.expiration.ms=604800000 - control.plane.listener.name=null - transaction.state.log.replication.factor=3 - num.io.threads=8 - sasl.login.refresh.buffer.seconds=300 - offsets.commit.required.acks=-1 - connection.failed.authentication.delay.ms=100 - delete.topic.enable=true - quota.window.size.seconds=1 - offsets.commit.timeout.ms=5000 - log.cleaner.max.compaction.lag.ms=9223372036854775807 - zookeeper.ssl.enabled.protocols=null - log.retention.ms=604800000 - alter.log.dirs.replication.quota.window.num=11 - log.cleaner.enable=true - offsets.load.buffer.size=5242880 - controlled.shutdown.max.retries=3 - offsets.topic.replication.factor=3 - transaction.state.log.min.isr=1 - sasl.kerberos.ticket.renew.jitter=0.05 - zookeeper.session.timeout.ms=18000 - log.retention.bytes=-1 - controller.quota.window.size.seconds=1 - sasl.jaas.config=null - sasl.kerberos.min.time.before.relogin=60000 - offsets.retention.minutes=10080 - replica.fetch.backoff.ms=1000 - inter.broker.protocol.version=2.8-IV1 - kafka.metrics.reporters= - num.partitions=1 - socket.connection.setup.timeout.ms=10000 - broker.id.generation.enable=true - listeners=PLAINTEXT://:9092,OUTSIDE://:9094 - inter.broker.listener.name=null - alter.config.policy.class.name=null - delegation.token.expiry.check.interval.ms=3600000 - log.flush.scheduler.interval.ms=9223372036854775807 - zookeeper.max.in.flight.requests=10 - log.index.size.max.bytes=10485760 - sasl.login.callback.handler.class=null - replica.fetch.max.bytes=1048576 - sasl.server.callback.handler.class=null - log.cleaner.dedupe.buffer.size=134217728 - advertised.port=null - log.cleaner.io.buffer.size=524288 - create.topic.policy.class.name=null - controlled.shutdown.retry.backoff.ms=5000 - security.providers=null - log.roll.hours=168 - log.cleanup.policy=delete - log.flush.start.offset.checkpoint.interval.ms=60000 - host.name= - log.roll.jitter.ms=null - transaction.state.log.segment.bytes=104857600 - offsets.topic.segment.bytes=104857600 - group.initial.rebalance.delay.ms=3000 - log.index.interval.bytes=4096 - log.cleaner.backoff.ms=15000 - ssl.truststore.location=null - offset.metadata.max.bytes=4096 - ssl.keystore.password=null - zookeeper.sync.time.ms=2000 - fetch.max.bytes=57671680 - max.poll.interval.ms=null - compression.type=producer - max.connections.per.ip.overrides= - sasl.login.refresh.window.factor=0.8 - kafka.metrics.polling.interval.secs=10 - max.incremental.fetch.session.cache.slots=1000 - delegation.token.master.key=null - reserved.broker.max.id=1000 - transaction.remove.expired.transaction.cleanup.interval.ms=3600000 - log.message.downconversion.enable=true - transaction.state.log.load.buffer.size=5242880 - sasl.enabled.mechanisms=GSSAPI - num.replica.alter.log.dirs.threads=null - group.min.session.timeout.ms=6000 - log.cleaner.io.buffer.load.factor=0.9 - transaction.max.timeout.ms=900000 - group.max.size=2147483647 - delegation.token.max.lifetime.ms=604800000 - broker.id=0 - offsets.topic.compression.codec=0 - zookeeper.ssl.endpoint.identification.algorithm=HTTPS - replication.quota.window.num=11 - advertised.listeners=PLAINTEXT://:9092,OUTSIDE://kafka-0.kafka.kafka-shared-cluster.svc.cluster.local:9094 - queued.max.request.bytes=-1 What have been verified RAID I/O - no spikes before reboot Zookeeper no logs indicating any connection problem Ping - response time is raising periodically Kafka logging level set to DEBUG - java.io.EOFException what is just DEBUG log, not WARNING or ERROR K8 node logs - nothing significant beside readiness and liveness probes Pods config Containers: kafka: Image: wurstmeister/kafka:latest Ports: 9092/TCP, 9094/TCP, 9999/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP State: Running Started: Thu, 10 Feb 2022 16:36:48 +0100 Last State: Terminated Reason: Error Exit Code: 137 Started: Tue, 08 Feb 2022 21:12:26 +0100 Finished: Thu, 10 Feb 2022 16:36:36 +0100 Ready: True Restart Count: 76 Limits: cpu: 24 memory: 64Gi Requests: cpu: 1 memory: 2Gi Liveness: tcp-socket :9092 delay=3600s timeout=5s period=10s #success=1 #failure=3 Readiness: tcp-socket :9092 delay=5s timeout=6s period=10s #success=1 #failure=5 Environment: KAFKA_AUTO_CREATE_TOPICS_ENABLE: false ALLOW_PLAINTEXT_LISTENER: yes BROKER_ID_COMMAND: hostname | awk -F'-' '{print $$NF}' HOSTNAME_COMMAND: hostname -f KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://:9092,OUTSIDE://_{HOSTNAME_COMMAND}:9094 KAFKA_LISTENERS: PLAINTEXT://:9092,OUTSIDE://:9094 KAFKA_ZOOKEEPER_CONNECT: zookeeper-service.kafka-shared-cluster.svc.cluster.local:2181/kafka KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,OUTSIDE:PLAINTEXT KAFKA_LOG_RETENTION_MS: 604800000 KAFKA_LOG_DIRS: /kafka KAFKA_SESSION_TIMEOUT_MS: 10000 KAFKA_MAX_POLL_INTERVAL_MS: 60000 KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 3000 KAFKA_JMX_OPTS: run command properties -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/opt/kafka/bin/../logs/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M Java heap size in bytes uintx ErgoHeapSizeLimit = 0 {product} uintx HeapSizePerGCThread = 87241520 {product} uintx InitialHeapSize := 1073741824 {product} uintx LargePageHeapSizeThreshold = 134217728 {product} uintx MaxHeapSize := 17179869184 {product} Question do you have any ideas why only this particular pod is failing, and any suggestion for further steps? -- EDIT -- Containers: kafka: Image: wurstmeister/kafka:latest Ports: 9092/TCP, 9094/TCP, 9999/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP State: Running Started: Tue, 08 Mar 2022 17:50:11 +0100 Last State: Terminated Reason: Error Exit Code: 137 Started: Tue, 08 Mar 2022 16:35:38 +0100 Finished: Tue, 08 Mar 2022 17:49:51 +0100 Ready: True Restart Count: 1 Limits: cpu: 24 memory: 64Gi Requests: cpu: 1 memory: 2Gi Liveness: tcp-socket :9092 delay=3600s timeout=5s period=10s #success=1 #failure=3 Readiness: tcp-socket :9092 delay=5s timeout=6s period=10s #success=1 #failure=5 Environment: KAFKA_AUTO_CREATE_TOPICS_ENABLE: false ALLOW_PLAINTEXT_LISTENER: yes BROKER_ID_COMMAND: hostname | awk -F'-' '{print $$NF}' HOSTNAME_COMMAND: hostname -f KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://:9092,OUTSIDE://_{HOSTNAME_COMMAND}:9094 KAFKA_LISTENERS: PLAINTEXT://:9092,OUTSIDE://:9094 KAFKA_ZOOKEEPER_CONNECT: zookeeper-service.kafka-shared-cluster.svc.cluster.local:2181/kafka KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,OUTSIDE:PLAINTEXT KAFKA_LOG_RETENTION_MS: 604800000 KAFKA_LOG_DIRS: /kafka KAFKA_SESSION_TIMEOUT_MS: 10000 KAFKA_MAX_POLL_INTERVAL_MS: 60000 KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 3000 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 3 KAFKA_HEAP_OPTS: -Xmx6G -Xms6G KAFKA_DEFAULT_REPLICATION_FACTOR: 3 KAFKA_MIN_INSYNC_REPLICAS: 2 KAFKA_REPLICA_LAG_TIME_MAX_MS: 80000 KAFKA_NUM_RECOVERY_THREADS_PER_DATA_DIR: 6 Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kafka-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: kafka-data-kafka-0 ReadOnly: false jolokia-agent: Type: ConfigMap (a volume populated by a ConfigMap) Name: jolokia-agent Optional: false Volumes: kafka-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: kafka-data-kafka-0 ReadOnly: false jolokia-agent: Type: ConfigMap (a volume populated by a ConfigMap) Name: jolokia-agent Optional: false default-token-d57jd: Type: Secret (a volume populated by a Secret) SecretName: default-token-d57jd Optional: false PVC Name: kafka-data-kafka-0 Namespace: kafka-shared-cluster StorageClass: local-storage Status: Bound Volume: pvc-1d23ba70-cb15-43c3-91b1-4febc8fd9896 Labels: app=kafka Annotations: pv.kubernetes.io/bind-completed: yes pv.kubernetes.io/bound-by-controller: yes volume.beta.kubernetes.io/storage-provisioner: rancher.io/local-path volume.kubernetes.io/selected-node: xxx Finalizers: [kubernetes.io/pvc-protection] Capacity: 10Gi Access Modes: RWO VolumeMode: Filesystem Used By: kafka-0 Events: <none>
Error 503 Backend fetch failed Guru Meditation: XID: 45654 Varnish cache server
I have created helm chart for varnish cache server which is running in kubernetes cluster , while testing with the "external IP" generated its throwing error , sharing below Sharing varnish.vcl, values.yaml and deployment.yaml below . Any suggestions how to resolve as I have hardcoded the backend/web server as .host="www.varnish-cache.org" with port : "80". My requirement is on executing curl -IL I should get the response with cached values not as described above (directly from backend server).. Any solutions/approach would be welcomed. varnish.vcl: VCL version 5.0 is not supported so it should be 4.0 or 4.1 even though actually used Varnish version is 6 vcl 4.1; import std; # The minimal Varnish version is 5.0 # For SSL offloading, pass the following header in your proxy server or load balancer: 'X-Forwarded-Proto: https' {{ .Values.varnishconfigData | indent 2 }} sub vcl_recv { # set req.backend_hint = default; # unset req.http.cookie; if(req.url == "/healthcheck") { return(synth(200,"OK")); } if(req.url == "/index.html") { return(synth(200,"OK")); } } probe index { .url = "/index.html"; .timeout = 60ms; .interval = 2s; .window = 5; .threshold = 3; } backend website { .host = "www.varnish-cache.org"; .port = "80"; .probe = index; #.probe = { # .url = "/favicon.ico"; #.timeout = 60ms; #.interval = 2s; #.window = 5; #.threshold = 3; # } } vcl_recv { if ( req.url ~ "/index.html/") { set req.backend = website; } else { Set req.backend = default; } } #DAEMON_OPTS="-a :80 \ #-T localhost:6082 \ #-f /etc/varnish/default.vcl \ #-S /etc/varnish/secret \ #-s malloc,256m" #-p http_resp_hdr_len=65536 \ #-p http_resp_size=98304 \ #sub vcl_recv { ## # Remove the cookie header to enable caching # unset req.http.cookie; #} #sub vcl_deliver { # if (obj.hits > 0) { # set resp.http.X-Cache = "HIT"; # } else { # set resp.http.X-Cache = "MISS"; # } #} values.yaml: # Default values for varnish. # This is a YAML-formatted file. # Declare variables to be passed into your templates. replicaCount: 1 image: repository: varnish tag: 6.3 pullPolicy: IfNotPresent nameOverride: "" fullnameOverride: "" service: # type: ClusterIP type: LoadBalancer port: 80 varnishconfigData: |- backend default { .host = "http://35.170.216.115/"; .port = "80"; .first_byte_timeout = 60s; .connect_timeout = 300s ; .probe = { .url = "/"; .timeout = 1s; .interval = 5s; .window = 5; .threshold = 3; } } sub vcl_backend_response { set beresp.ttl = 5m; } ingress: enabled: false annotations: {} # kubernetes.io/ingress.class: nginx # kubernetes.io/tls-acme: "true" path: / hosts: - chart-example.local tls: [] # - secretName: chart-example-tls # hosts: # - chart-example.local resources: limits: memory: 128Mi requests: memory: 64Mi #resources: {} # We usually recommend not to specify default resources and to leave this as a conscious # choice for the user. This also increases chances charts run on environments with little # resources, such as Minikube. If you do want to specify resources, uncomment the following # lines, adjust them as necessary, and remove the curly braces after 'resources:'. # limits: # cpu: 100m # memory: 128Mi # requests: # cpu: 100m # memory: 128Mi nodeSelector: {} tolerations: [] affinity: {} Deployment.yaml: apiVersion: apps/v1beta2 kind: Deployment metadata: name: {{ include "varnish.fullname" . }} labels: app: {{ include "varnish.name" . }} chart: {{ include "varnish.chart" . }} release: {{ .Release.Name }} heritage: {{ .Release.Service }} spec: replicas: {{ .Values.replicaCount }} selector: matchLabels: app: {{ include "varnish.name" . }} release: {{ .Release.Name }} template: metadata: labels: app: {{ include "varnish.name" . }} release: {{ .Release.Name }} spec: volumes: - name: varnish-config configMap: name: {{ include "varnish.fullname" . }}-varnish-config items: - key: default.vcl path: default.vcl containers: - name: {{ .Chart.Name }} image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} env: - name: VARNISH_VCL value: /etc/varnish/default.vcl volumeMounts: - name: varnish-config mountPath : /etc/varnish/ ports: - name: http containerPort: 80 protocol: TCP targetPort: 80 livenessProbe: httpGet: path: /healthcheck # port: http port: 80 failureThreshold: 3 initialDelaySeconds: 45 timeoutSeconds: 10 periodSeconds: 20 readinessProbe: httpGet: path: /healthcheck #port: http port: 80 initialDelaySeconds: 10 timeoutSeconds: 15 periodSeconds: 5 resources: {{ toYaml .Values.resources | indent 12 }} {{- with .Values.nodeSelector }} nodeSelector: {{ toYaml . | indent 8 }} {{- end }} {{- with .Values.affinity }} affinity: {{ toYaml . | indent 8 }} {{- end }} {{- with .Values.tolerations }} tolerations: {{ toYaml . | indent 8 }} {{- end }} Did checked with varnish logs , executed varnishlog -c and got following output * << Request >> 556807 - Begin req 556806 rxreq - Timestamp Start: 1584534974.251924 0.000000 0.000000 - Timestamp Req: 1584534974.251924 0.000000 0.000000 - VCL_use boot - ReqStart 100.115.128.0 26466 a0 - ReqMethod GET - ReqURL /healthcheck - ReqProtocol HTTP/1.1 - ReqHeader Host: 100.115.128.11:80 - ReqHeader User-Agent: kube-probe/1.14 - ReqHeader Accept-Encoding: gzip - ReqHeader Connection: close - ReqHeader X-Forwarded-For: 100.115.128.0 - VCL_call RECV - VCL_return synth - VCL_call HASH - VCL_return lookup - Timestamp Process: 1584534974.251966 0.000042 0.000042 - RespHeader Date: Wed, 18 Mar 2020 12:36:14 GMT - RespHeader Server: Varnish - RespHeader X-Varnish: 556807 - RespProtocol HTTP/1.1 - RespStatus 200 - RespReason OK - RespReason OK - VCL_call SYNTH - RespHeader Content-Type: text/html; charset=utf-8 - RespHeader Retry-After: 5 - VCL_return deliver - RespHeader Content-Length: 229 - Storage malloc Transient - Filters - RespHeader Accept-Ranges: bytes - RespHeader Connection: close - Timestamp Resp: 1584534974.252121 0.000197 0.000155 - ReqAcct 125 0 125 210 229 439 - End
I don't think this will work: .host = "www.varnish-cache.org"; .host = "100.68.38.132" It has two host declaration and it's missing the ";" Please try to change it to .host = "100.68.38.132"; Sharing the logs generated when running command varnishlog -g request -q "ReqHeader:Host eq 'a2dc15095678711eaae260ae72bc140c-214951329.ap-southeast-1.elb.amazonaws.com'" -q "ReqUrl eq '/'" below please look into it.. * << Request >> 1512355 - Begin req 1512354 rxreq - Timestamp Start: 1584707667.287292 0.000000 0.000000 - Timestamp Req: 1584707667.287292 0.000000 0.000000 - VCL_use boot - ReqStart 100.112.64.0 51532 a0 - ReqMethod GET - ReqURL / - ReqProtocol HTTP/1.1 - ReqHeader Host: 52.220.214.66 - ReqHeader User-Agent: Mozilla/5.0 zgrab/0.x - ReqHeader Accept: */* - ReqHeader Accept-Encoding: gzip - ReqHeader X-Forwarded-For: 100.112.64.0 - VCL_call RECV - ReqUnset Host: 52.220.214.66 - ReqHeader host: 52.220.214.66 - VCL_return hash - VCL_call HASH - VCL_return lookup - VCL_call MISS - VCL_return fetch - Link bereq 1512356 fetch - Timestamp Fetch: 1584707667.287521 0.000228 0.000228 - RespProtocol HTTP/1.1 - RespStatus 503 - RespReason Backend fetch failed - RespHeader Date: Fri, 20 Mar 2020 12:34:27 GMT - RespHeader Server: Varnish - RespHeader Content-Type: text/html; charset=utf-8 - RespHeader Retry-After: 5 - RespHeader X-Varnish: 1512355 - RespHeader Age: 0 - RespHeader Via: 1.1 varnish (Varnish/6.3) - VCL_call DELIVER - RespHeader X-Cache: uncached - VCL_return deliver - Timestamp Process: 1584707667.287542 0.000250 0.000021 - Filters - RespHeader Content-Length: 284 - RespHeader Connection: keep-alive - Timestamp Resp: 1584707667.287591 0.000299 0.000048 - ReqAcct 110 0 110 271 284 555 - End ** << BeReq >> 1512356 -- Begin bereq 1512355 fetch -- VCL_use boot -- Timestamp Start: 1584707667.287401 0.000000 0.000000 -- BereqMethod GET -- BereqURL / -- BereqProtocol HTTP/1.1 -- BereqHeader User-Agent: Mozilla/5.0 zgrab/0.x -- BereqHeader Accept: */* -- BereqHeader Accept-Encoding: gzip -- BereqHeader X-Forwarded-For: 100.112.64.0 -- BereqHeader host: 52.220.214.66 -- BereqHeader X-Varnish: 1512356 -- VCL_call BACKEND_FETCH -- VCL_return fetch -- FetchError backend default: unhealthy -- Timestamp Beresp: 1584707667.287429 0.000028 0.000028 -- Timestamp Error: 1584707667.287432 0.000031 0.000002 -- BerespProtocol HTTP/1.1 -- BerespStatus 503 -- BerespReason Service Unavailable -- BerespReason Backend fetch failed -- BerespHeader Date: Fri, 20 Mar 2020 12:34:27 GMT -- BerespHeader Server: Varnish -- VCL_call BACKEND_ERROR -- BerespHeader Content-Type: text/html; charset=utf-8 -- BerespHeader Retry-After: 5 -- VCL_return deliver -- Storage malloc Transient -- Length 284 -- BereqAcct 0 0 0 0 0 0 -- End