How to deploy Apache Nifi Statefulset in cluster mode with external zookeeper? - kubernetes

I tried to deploy Apache Nifi (statefulset) in Kubernetes, in cluster mode. Firstyly I am trying only one node but I don't know where I am wrong in the yaml and how to access the Nifi UI when I deploy Nifi statefulset. I use external zookeeper.
Not sure if I have to create service for each node of the cluster.
On K8S dashboard nifi pod is working well.
I know that Stateful sets allow the creation of stable network identities of pods by creating what is known as a headless service. But how can I access the UI next ?
Nifi StatefulSet yaml file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nifi
labels:
name: nifi
app: nifi
annotations:
app.kubernetes.io/name: nifi
app.kubernetes.io/part-of: nifi
spec:
serviceName: nifi
# replicas: 2
revisionHistoryLimit: 1
# strategy:
# type: Recreate
selector:
matchLabels:
app: nifi
template:
metadata:
labels:
app: nifi
spec:
automountServiceAccountToken: false
enableServiceLinks: false
restartPolicy: Always
securityContext:
runAsGroup: 1000
runAsUser: 1000
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: nifi
image: XXX
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: nifi
- containerPort: 8082
name: cluster
env:
- name: "NIFI_SENSITIVE_PROPS_KEY"
value: "nificlusterbulot"
- name: NIFI_WEB_HTTP_HOST
value: "nifi-0.NAMESPACE_NAME.svc.cluster.local"
- name: NIFI_WEB_HTTP_PORT
value: "8080"
- name: NIFI_ANALYTICS_PREDICT_ENABLED
value: "true"
- name: NIFI_ELECTION_MAX_CANDIDATES
value: "2"
- name: NIFI_ELECTION_MAX_WAIT
value: "1 min"
- name: NIFI_CLUSTER_IS_NODE
value: "true"
- name: NIFI_JVM_HEAP_INIT
value: "3g"
- name: NIFI_JVM_HEAP_MAX
value: "4g"
- name: NIFI_CLUSTER_NODE_CONNECTION_TIMEOUT
value: "2 min"
- name: NIFI_CLUSTER_PROTOCOL_CONNECTION_HANDSHAKE_TIMEOUT
value: "2 min"
- name: NIFI_CLUSTER_NODE_PROTOCOL_MAX_THREADS
value: "15"
- name: NIFI_CLUSTER_NODE_PROTOCOL_PORT
value: "8082"
- name: NIFI_CLUSTER_NODE_READ_TIMEOUT
value: "15"
- name: NIFI_ZK_CONNECT_STRING
value: "zookeeper:2181"
- name: NIFI_CLUSTER_NODE_ADDRESS
value: "nifi-0.nifi.NAMESPACE_NAME.cluster.local"
# valueFrom:
# fieldRef:
# fieldPath: status.podIP
# - name: HOSTNAME
# valueFrom:
# fieldRef:
# fieldPath: status.podIP
livenessProbe:
exec:
command:
- pgrep
- java
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
readinessProbe:
exec:
command:
- pgrep
- java
initialDelaySeconds: 180
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
resources:
requests:
cpu: 400m
memory: 1Gi
limits:
cpu: 500m
memory: 2Gi
volumes:
- name: pv-01
persistentVolumeClaim:
claimName: pv-claim
Zookeeper yaml file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: zookeeper
namespace: namespace_name
labels:
name : zookeeper
app : zookeeper
# annotations:
# app.kubernetes.io/name: zookeeper
# app.kubernetes.io/part-of: nifi
spec:
revisionHistoryLimit: 1
serviceName: zookeeper
selector:
matchLabels:
app: zookeeper
template:
metadata:
labels:
app: zookeeper
spec:
automountServiceAccountToken: false
enableServiceLinks: false
restartPolicy: Always
securityContext:
runAsGroup: 1000
runAsUser: 1000
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: zookeeper
image: XXX
imagePullPolicy: IfNotPresent
ports:
- containerPort: 2181
name: zk
- containerPort: 2182
name: zc
# - containerPort: 8083
# name: web
- containerPort: 5111
name: cmd
env:
- name: ALLOW_ANONYMOUS_LOGIN
value: "yes"
- name: ZOO_ADMINSERVER_ENABLED
value: "true"
- name: ZOO_AUTOPURGE_PURGEINTERVAL
value: "2"
- name: ZOO_AUTOPURGE_SNAPRETAINCOUNT
value: "10"
- name: ZOO_INIT_LIMIT
value: "10"
- name: ZOO_STANDALONE_ENABLED
value: "true"
- name: ZOO_SYNC_LIMIT
value: "6"
- name: ZOO_TICK_TIME
value: "4000"
livenessProbe:
exec:
command:
- which
- java
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
readinessProbe:
tcpSocket:
port: 2181
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
successThreshold: 1
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 300m
memory: 2Gi
securityContext:
allowPrivilegeEscalation: false
privileged: false
runAsGroup: 1000
runAsUser: 1000
- name: pv-01
persistentVolumeClaim:
claimName: pv-claim

Related

configuring keycloak with external postgres database

How do we configure keycloak to use the external postgres (AWS RDS)?
We deployed it in kubernetes using quarkus distro and update dthe DB env variables in our deployment.yaml , however it is still taking the local h2 data base and not the postgres.
For better understanding providing the deployment.yaml file we are using:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "5"
kubectl.kubernetes.io/last-applied-configuration: |
creationTimestamp: "2022-06-21T16:47:29Z"
generation: 5
labels:
app: keycloak
name: keycloak
namespace: kc***
resourceVersion: "29233550"
uid: 3634683e-657c-4278-9002-82a3ce64b968
spec:
progressDeadlineSeconds: 600
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: keycloak
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: keycloak
spec:
containers:
- args:
- start
- --hostname=kc-test.k8.com
- --https-certificate-file=/opt/pem/cert-pem/cert.pem
- --https-certificate-key-file=/opt/pem/key-pem/key.pem
- --log-level=DEBUG
env:
- name: KEYCLOAK_ADMIN
value: ****
- name: KEYCLOAK_ADMIN_PASSWORD
value: *****
- name: PROXY_ADDRESS_FORWARDING
value: "true"
- name: DB_ADDR
value: jdbc:postgresql://database.c**7irl*****.us-east-1.rds.amazonaws.com/database
- name: DB_DATABASE
value: ****
- name: DB_USER
value: postgres
- name: DB_SCHEMA
value: public
- name: DB_VENDOR
value: POSTGRES
- name: JGROUPS_DISCOVERY_PROTOCOL
value: dns.DNS_PING
- name: JGROUPS_DISCOVERY_PROPERTIES
value: dns_query=keycloak
- name: CACHE_OWNERS_COUNT
value: "2"
- name: CACHE_OWNERS_AUTH_SESSIONS_COUNT
value: "2"
image: quay.io/keycloak/keycloak:17.0.0
imagePullPolicy: IfNotPresent
name: keycloak
ports:
- containerPort: 7600
name: jgroups
protocol: TCP
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 8443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /realms/master
port: 8443
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
resources: {}
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/pem/key-pem
name: key-pem
- mountPath: /opt/pem/cert-pem
name: cert-pem
- mountPath: /opt/keycloak/data
name: keydata
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: key-pem
name: key-pem
- configMap:
defaultMode: 420
name: cert-pem
name: cert-pem
- emptyDir: {}
name: keydata
status:
availableReplicas: 3
conditions:
- lastTransitionTime: "2022-06-21T18:02:32Z"
lastUpdateTime: "2022-06-21T18:02:32Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2022-06-21T18:01:53Z"
lastUpdateTime: "2022-06-21T18:16:41Z"
message: ReplicaSet "keycloak-5c84476694" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 5
readyReplicas: 3
replicas: 3
updatedReplicas: 3
Is your external DB also in same namespace?
if yes you can use below way.
<external postgres (AWS RDS)>secret-name in k8s secret contains all the below details.
Using this method it will dynamically fetch details from secret.
env:
- name: DB_DATABASE
valueFrom:
secretKeyRef:
name: database-secret-name
key: dbname
- name: DB_ADDR
valueFrom:
secretKeyRef:
name: database-secret-name
key: host
- name: DB_PORT
valueFrom:
secretKeyRef:
name: database-secret-name
key: port
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: database-secret-name
key: password
- name: DB_USER
valueFrom:
secretKeyRef:
name: database-secret-name
key: user
if your external postgres in different namespace, copy you database secret to keycloak namespace and give it a try.
DB_ADDR is env variable for Keycloak versions 16-. Use doc for you Keycloak version https://www.keycloak.org/server/all-config
Keycloak 17+ has KC_DB_URL:
db-url
The full database JDBC URL.
If not provided, a default URL is set based on the selected database vendor. For instance, if using 'postgres', the default JDBC URL would be 'jdbc:postgresql://localhost/keycloak'.
CLI: --db-url
Env: KC_DB_URL
Of course configure also other env variables for your Keycloak version properly.

AKS - Pods created by HPA trigger are getting terminated immediately after they are created

When we had a look into the events in AKS, we observed the below errors for all the pods which were created and terminated:
2m47s Warning FailedMount pod/app-fd6c6b8d9-ssr2t Unable to attach or mount volumes: unmounted volumes=[log-volume config-volume log4j2 secrets-app-inline kube-api-access-z49xc], unattached volumes=[log-volume config-volume log4j2 secrets-app-inline kube-api-access-z49xc]: timed out waiting for the condition
We already have 2 replicas running for the application so don't think that the error will be due to AccessModes of volumes.
Below is the HPA config:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: app-cpu-hpa
namespace: namespace-dev
spec:
maxReplicas: 5
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
metrics:
- type: Resource
resource:
name: cpu
targetAverageValue: 500m
Below is the deployment config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
labels:
app: app
group: app
obs: appd
spec:
replicas: 2
selector:
matchLabels:
app: app
template:
metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/app: runtime/default
labels:
app: app
group: app
obs: appd
spec:
containers:
- name: app
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 2000
imagePullPolicy: {{ .Values.image.pullPolicy }}
resources:
limits:
cpu: {{ .Values.app.limits.cpu }}
memory: {{ .Values.app.limits.memory }}
requests:
cpu: {{ .Values.app.requests.cpu }}
memory: {{ .Values.app.requests.memory }}
env:
- name: LOG_DIR_PATH
value: /opt/apps/
volumeMounts:
- name: log-volume
mountPath: /opt/apps/app/logs
- name: config-volume
mountPath: /script/start.sh
subPath: start.sh
- name: log4j2
mountPath: /opt/appdynamics-java/ver21.9.0.33073/conf/logging/log4j2.xml
subPath: log4j2.xml
- name: secrets-app-inline
mountPath: "/mnt/secrets-app"
readOnly: true
readinessProbe:
failureThreshold: 3
httpGet:
path: /actuator/info
port: {{ .Values.metrics.port }}
scheme: "HTTP"
httpHeaders:
- name: Authorization
value: "Basic XXX50aXXXXXX=="
- name: cache-control
value: "no-cache"
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
initialDelaySeconds: 60
livenessProbe:
httpGet:
path: /actuator/info
port: {{ .Values.metrics.port }}
scheme: "HTTP"
httpHeaders:
- name: Authorization
value: "Basic XXX50aXXXXXX=="
- name: cache-control
value: "no-cache"
initialDelaySeconds: 300
periodSeconds: 5
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 3
volumes:
- name: log-volume
persistentVolumeClaim:
claimName: {{ .Values.apppvc.name }}
- name: config-volume
configMap:
name: {{ .Values.configmap.name }}-configmap
defaultMode: 0755
- name: secrets-app-inline
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "app-kv-secret"
nodePublishSecretRef:
name: secrets-app-creds
- name: log4j2
configMap:
name: log4j2
defaultMode: 0755
restartPolicy: Always
imagePullSecrets:
- name: {{ .Values.imagePullSecrets }}
Can someone please let me know where the config might be going wrong?

cp: cannot stat '/opt/flink/opt/flink-metrics-prometheus-*.jar': No such file or directory in apache flink

I am upgrade apache flink 1.10 to apache flink 1.11 in kubernetes, but the jobmanager kubernetes pod log shows:
cp: cannot stat '/opt/flink/opt/flink-metrics-prometheus-*.jar': No such file or directory
this is my jobmanager pod yaml:
kind: Deployment
apiVersion: apps/v1
metadata:
name: report-flink-jobmanager
namespace: middleware
selfLink: /apis/apps/v1/namespaces/middleware/deployments/report-flink-jobmanager
uid: b7bd8f0d-cddb-44e7-8bbe-b96e68dbfbcd
resourceVersion: '13655071'
generation: 44
creationTimestamp: '2020-06-08T02:11:33Z'
labels:
app.kubernetes.io/instance: report-flink
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: flink
app.kubernetes.io/version: 1.10.0
component: jobmanager
helm.sh/chart: flink-0.1.15
annotations:
deployment.kubernetes.io/revision: '6'
meta.helm.sh/release-name: report-flink
meta.helm.sh/release-namespace: middleware
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/instance: report-flink
app.kubernetes.io/name: flink
component: jobmanager
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: report-flink
app.kubernetes.io/name: flink
component: jobmanager
spec:
volumes:
- name: flink-config-volume
configMap:
name: report-flink-config
items:
- key: flink-conf.yaml
path: flink-conf.yaml.tpl
- key: log4j.properties
path: log4j.properties
- key: security.properties
path: security.properties
defaultMode: 420
- name: flink-pro-persistent-storage
persistentVolumeClaim:
claimName: flink-pv-claim
containers:
- name: jobmanager
image: 'flink:1.11'
command:
- /bin/bash
- '-c'
- >-
cp /opt/flink/opt/flink-metrics-prometheus-*.jar
/opt/flink/opt/flink-s3-fs-presto-*.jar /opt/flink/lib/ && wget
https://repo1.maven.org/maven2/com/github/oshi/oshi-core/3.4.0/oshi-core-3.4.0.jar
-O /opt/flink/lib/oshi-core-3.4.0.jar && wget
https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.4.0/jna-5.4.0.jar
-O /opt/flink/lib/jna-5.4.0.jar && wget
https://repo1.maven.org/maven2/net/java/dev/jna/jna-platform/5.4.0/jna-platform-5.4.0.jar
-O /opt/flink/lib/jna-platform-5.4.0.jar && cp
$FLINK_HOME/conf/flink-conf.yaml.tpl
$FLINK_HOME/conf/flink-conf.yaml && $FLINK_HOME/bin/jobmanager.sh
start; while :; do if [[ -f $(find log -name '*jobmanager*.log'
-print -quit) ]]; then tail -f -n +1 log/*jobmanager*.log; fi;
done
workingDir: /opt/flink
ports:
- name: blob
containerPort: 6124
protocol: TCP
- name: rpc
containerPort: 6123
protocol: TCP
- name: ui
containerPort: 8081
protocol: TCP
- name: metrics
containerPort: 9999
protocol: TCP
env:
- name: JVM_ARGS
value: '-Djava.security.properties=/opt/flink/conf/security.properties'
- name: FLINK_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: APOLLO_META
valueFrom:
configMapKeyRef:
name: pro-config
key: apollo.meta
- name: ENV
valueFrom:
configMapKeyRef:
name: pro-config
key: env
resources: {}
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf/flink-conf.yaml.tpl
subPath: flink-conf.yaml.tpl
- name: flink-config-volume
mountPath: /opt/flink/conf/log4j.properties
subPath: log4j.properties
- name: flink-config-volume
mountPath: /opt/flink/conf/security.properties
subPath: security.properties
- name: flink-pro-persistent-storage
mountPath: /opt/flink/data/
livenessProbe:
tcpSocket:
port: 6124
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 15
successThreshold: 1
failureThreshold: 3
readinessProbe:
tcpSocket:
port: 6123
initialDelaySeconds: 20
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
serviceAccountName: jobmanager
serviceAccount: jobmanager
securityContext: {}
schedulerName: default-scheduler
strategy:
type: Recreate
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
status:
observedGeneration: 44
replicas: 1
updatedReplicas: 1
unavailableReplicas: 1
conditions:
- type: Available
status: 'False'
lastUpdateTime: '2020-08-19T06:26:56Z'
lastTransitionTime: '2020-08-19T06:26:56Z'
reason: MinimumReplicasUnavailable
message: Deployment does not have minimum availability.
- type: Progressing
status: 'False'
lastUpdateTime: '2020-08-19T06:42:56Z'
lastTransitionTime: '2020-08-19T06:42:56Z'
reason: ProgressDeadlineExceeded
message: >-
ReplicaSet "report-flink-jobmanager-7b8b9bd6bb" has timed out
progressing.
should I remove the not exists jar file? how to fix this?

Sometimes missing desiredContainers

My Zalenium is deployed in Kubernetes. I have set option desiredContainers = 2 and it's working. But sometimes desired containers are not available. Tests are working properly, even when desired containers not available. After "restart" containers appears, but I have no idea why they sometimes dissapears. Does anyone have idea what's going on?
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: zalenium
namespace: zalenium-omdc
selfLink: /apis/extensions/v1beta1/namespaces/zalenium-omdc/deployments/zalenium
uid: cbafe254-3e28-4889-a09e-ccfa500ff628
resourceVersion: '25201258'
generation: 24
creationTimestamp: '2019-09-17T13:24:52Z'
labels:
app: zalenium
instance: zalenium
annotations:
deployment.kubernetes.io/revision: '24'
kubectl.kubernetes.io/last-applied-configuration: >
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"zalenium","instance":"zalenium"},"name":"zalenium","namespace":"zalenium-omdc"},"spec":{"replicas":1,"selector":{"matchLabels":{"instance":"zalenium"}},"template":{"metadata":{"labels":{"app":"zalenium","instance":"zalenium"}},"spec":{"containers":[{"args":["start"],"env":[{"name":"ZALENIUM_KUBERNETES_CPU_REQUEST","value":"250m"},{"name":"ZALENIUM_KUBERNETES_CPU_LIMIT","value":"1000m"},{"name":"ZALENIUM_KUBERNETES_MEMORY_REQUEST","value":"500Mi"},{"name":"ZALENIUM_KUBERNETES_MEMORY_LIMIT","value":"2Gi"},{"name":"DESIRED_CONTAINERS","value":"2"},{"name":"MAX_DOCKER_SELENIUM_CONTAINERS","value":"16"},{"name":"SELENIUM_IMAGE_NAME","value":"elgalu/selenium"},{"name":"VIDEO_RECORDING_ENABLED","value":"true"},{"name":"SCREEN_WIDTH","value":"1440"},{"name":"SCREEN_HEIGHT","value":"900"},{"name":"MAX_TEST_SESSIONS","value":"1"},{"name":"NEW_SESSION_WAIT_TIMEOUT","value":"1800000"},{"name":"DEBUG_ENABLED","value":"false"},{"name":"SEND_ANONYMOUS_USAGE_INFO","value":"true"},{"name":"TZ","value":"UTC"},{"name":"KEEP_ONLY_FAILED_TESTS","value":"false"},{"name":"RETENTION_PERIOD","value":"3"}],"image":"dosel/zalenium:3","imagePullPolicy":"IfNotPresent","livenessProbe":{"httpGet":{"path":"/status","port":4444},"initialDelaySeconds":90,"periodSeconds":5,"timeoutSeconds":1},"name":"zalenium","ports":[{"containerPort":4444,"protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/status","port":4444},"timeoutSeconds":1},"resources":{"requests":{"cpu":"500m","memory":"500Mi"}},"volumeMounts":[{"mountPath":"/home/seluser/videos","name":"zalenium-videos"},{"mountPath":"/tmp/mounted","name":"zalenium-data"}]}],"serviceAccountName":"zalenium","volumes":[{"emptyDir":{},"name":"zalenium-videos"},{"emptyDir":{},"name":"zalenium-data"}]}}}}
spec:
replicas: 1
selector:
matchLabels:
instance: zalenium
template:
metadata:
creationTimestamp: null
labels:
app: zalenium
instance: zalenium
spec:
volumes:
- name: zalenium-videos
emptyDir: {}
- name: zalenium-data
emptyDir: {}
containers:
- name: zalenium
image: 'dosel/zalenium:3'
args:
- start
ports:
- containerPort: 4444
protocol: TCP
env:
- name: ZALENIUM_KUBERNETES_CPU_REQUEST
value: 250m
- name: ZALENIUM_KUBERNETES_CPU_LIMIT
value: 1000m
- name: ZALENIUM_KUBERNETES_MEMORY_REQUEST
value: 500Mi
- name: ZALENIUM_KUBERNETES_MEMORY_LIMIT
value: 2Gi
- name: DESIRED_CONTAINERS
value: '2'
- name: MAX_DOCKER_SELENIUM_CONTAINERS
value: '16'
- name: SELENIUM_IMAGE_NAME
value: elgalu/selenium
- name: VIDEO_RECORDING_ENABLED
value: 'false'
- name: SCREEN_WIDTH
value: '1920'
- name: SCREEN_HEIGHT
value: '1080'
- name: MAX_TEST_SESSIONS
value: '1'
- name: NEW_SESSION_WAIT_TIMEOUT
value: '7200000'
- name: DEBUG_ENABLED
value: 'false'
- name: SEND_ANONYMOUS_USAGE_INFO
value: 'true'
- name: TZ
value: UTC
- name: KEEP_ONLY_FAILED_TESTS
value: 'false'
- name: RETENTION_PERIOD
value: '3'
- name: SEL_BROWSER_TIMEOUT_SECS
value: '7200'
- name: BROWSER_STACK_WAIT_TIMEOUT
value: 120m
resources:
limits:
memory: 1Gi
requests:
cpu: 500m
memory: 500Mi
volumeMounts:
- name: zalenium-videos
mountPath: /home/seluser/videos
- name: zalenium-data
mountPath: /tmp/mounted
livenessProbe:
httpGet:
path: /status
port: 4444
scheme: HTTP
initialDelaySeconds: 90
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /status
port: 4444
scheme: HTTP
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
nodeSelector:
dedicated: omdc
serviceAccountName: zalenium
serviceAccount: zalenium
securityContext: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dedicated
operator: In
values:
- omdc
schedulerName: default-scheduler
tolerations:
- key: dedicated
operator: Equal
value: omdc
effect: NoSchedule
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
status:
observedGeneration: 24
replicas: 1
updatedReplicas: 1
readyReplicas: 1
availableReplicas: 1
conditions:
- type: Available
status: 'True'
lastUpdateTime: '2019-10-22T06:57:52Z'
lastTransitionTime: '2019-10-22T06:57:52Z'
reason: MinimumReplicasAvailable
message: Deployment has minimum availability.
- type: Progressing
status: 'True'
lastUpdateTime: '2019-10-31T09:14:01Z'
lastTransitionTime: '2019-09-17T13:24:52Z'
reason: NewReplicaSetAvailable
message: ReplicaSet "zalenium-6df85c7f49" has successfully progressed.

Update Kafka in Kubernetes causes downtime

I'm running a 4 brokers Kafka cluster in Kubernetes. The replication factor is 3 and ISR is 2.
In addition, there's a producer service (running Spring stream) generating messages and a consumer service reading from the topic. Now I tried to update the Kafka cluster with a rolling update, hoping for no downtime, but during the update, the producer's log was filled with this error:
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
According to my calculation, when 1 broker is down there shouldn't be a problem because the min ISR is 2. However, it seems like the producer service is unaware of the rolling update and keep sending messages to the same broker...
Any ideas how to solve it?
This is my kafka.yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: kafka
namespace: default
labels:
app: kafka
spec:
serviceName: kafka
replicas: 4
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app: kafka
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9308"
spec:
nodeSelector:
middleware.node: "true"
imagePullSecrets:
- name: nexus-registry
terminationGracePeriodSeconds: 300
containers:
- name: kafka
image: kafka:2.12-2.1.0
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 3000m
memory: 1800Mi
requests:
cpu: 2000m
memory: 1800Mi
env:
# Replication
- name: KAFKA_DEFAULT_REPLICATION_FACTOR
value: "3"
- name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR
value: "3"
- name: KAFKA_MIN_INSYNC_REPLICAS
value: "2"
# Protocol Version
- name: KAFKA_INTER_BROKER_PROTOCOL_VERSION
value: "2.1"
- name: KAFKA_LOG_MESSAGE_FORMAT_VERSION
value: "2.1"
- name: ENABLE_AUTO_EXTEND
value: "true"
- name: KAFKA_DELETE_TOPIC_ENABLE
value: "true"
- name: KAFKA_RESERVED_BROKER_MAX_ID
value: "999999999"
- name: KAFKA_AUTO_CREATE_TOPICS_ENABLE
value: "true"
- name: KAFKA_PORT
value: "9092"
- name: KAFKA_ADVERTISED_PORT
value: "9092"
- name: KAFKA_NUM_RECOVERY_THREADS_PER_DATA_DIR
value: "10"
- name: KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR
value: "3"
- name: KAFKA_LOG_RETENTION_BYTES
value: "1800000000000"
- name: KAFKA_ADVERTISED_HOST_NAME
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: KAFKA_OFFSETS_RETENTION_MINUTES
value: "10080"
- name: KAFKA_ZOOKEEPER_CONNECT
valueFrom:
configMapKeyRef:
name: zk-config
key: zk.endpoints
- name: KAFKA_LOG_DIRS
value: /kafka/kafka-logs
ports:
- name: kafka
containerPort: 9092
- name: prometheus
containerPort: 7071
volumeMounts:
- name: data
mountPath: /kafka
readinessProbe:
tcpSocket:
port: 9092
timeoutSeconds: 1
failureThreshold: 12
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
- name: kafka-exporter
image: danielqsj/kafka-exporter:latest
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 500Mi
ports:
- containerPort: 9308
volumeClaimTemplates:
- metadata:
name: data
labels:
app: kafka
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2000Gi