This blog post (here specifically) details how to configure connection draining for a 'classic' version 1 load balancer using the AWS::ElasticLoadBalancing::LoadBalancer type, like so:
"ElasticLoadBalancer": {
"Type": "AWS::ElasticLoadBalancing::LoadBalancer",
"Properties": {
"ConnectionDrainingPolicy": {
"Enabled": "true",
"Timeout": "300"
},
...
}
}
How can I do this using the version 2 load balancer with type AWS::ElasticLoadBalancingV2::LoadBalancer?
My best guess from the documentation is that I should use LoadBalancerAttributes, but I can't find anything related to connection draining in the list of attributes here.
In Application Load Balancer(ELB V2 ) it in configured using TargetGroups and TargetGroupAttributes and is called Deregistration delay, not Connection draining.
deregistration_delay.timeout_seconds - The amount time for Elastic
Load Balancing to wait before changing the state of a deregistering
target from draining to unused. The range is 0-3600 seconds. The
default value is 300 seconds.
TargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: '20'
Related
We are using Linkerd 2.11.1 on Azure AKS Kubernetes. Amongst others there is a Deployment using using an Alpine Linux image containing Apache/mod_php/PHP8 serving an API. HTTPS is resolved by Traefik v2 with cert-manager, so that in coming traffic to the APIs is on port 80. The Linkerd proxy container is injected as a Sidecar.
Recently I saw that the API containers return 504 errors during a short period of time when doing a Rolling deployment. In the Sidecars log, I found the following :
[ 0.000590s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
[ 0.001062s] INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
[ 0.001078s] INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
[ 0.001081s] INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
[ 0.001083s] INFO ThreadId(01) linkerd2_proxy: Tap interface on 0.0.0.0:4190
[ 0.001085s] INFO ThreadId(01) linkerd2_proxy: Local identity is default.my-api.serviceaccount.identity.linkerd.cluster.local
[ 0.001088s] INFO ThreadId(01) linkerd2_proxy: Identity verified via linkerd-identity-headless.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.001090s] INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
[ 0.014676s] INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity: default.my-api.serviceaccount.identity.linkerd.cluster.local
[ 3674.769855s] INFO ThreadId(01) inbound:server{port=80}: linkerd_app_inbound::detect: Handling connection as opaque timeout=linkerd_proxy_http::version::Version protocol detection timed out after 10s
My guess is that this detection leads to the 504 errors somehow. However, if I add the linkerd inbound port annotation to the pod template (terraform syntax):
resource "kubernetes_deployment" "my_api" {
metadata {
name = "my-api"
namespace = "my-api"
labels = {
app = "my-api"
}
}
spec {
replicas = 20
selector {
match_labels = {
app = "my-api"
}
}
template {
metadata {
labels = {
app = "my-api"
}
annotations = {
"config.linkerd.io/inbound-port" = "80"
}
}
I get the following:
time="2022-03-01T14:56:44Z" level=info msg="Found pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
time="2022-03-01T14:56:44Z" level=info msg="Found pre-existing CSR: /var/run/linkerd/identity/end-entity/csr.der"
[ 0.000547s] INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
thread 'main' panicked at 'Failed to bind inbound listener: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }', /github/workspace/linkerd/app/src/lib.rs:195:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Can somebody tell me why it fails to bind the inbound listener?
Any help is much appreciated,
thanks,
Pascal
Found it : Kubernetes sends asynchronuously requests to shutdown the pods and to no longer send traffic to them. And if the pod shuts down faster than it's removal from the IP lists, it can receive requests when already being dead.
To fix this, I added a preStop lifecycle hook to the application container:
lifecycle {
pre_stop {
exec {
command = ["/bin/sh", "-c" , "sleep 5"]
}
}
}
and the following annotation to pod template :
annotations = {
"config.alpha.linkerd.io/proxy-wait-before-exit-seconds" = "10"
}
Documented here :
https://linkerd.io/2.11/tasks/graceful-shutdown/
and here :
https://blog.gruntwork.io/delaying-shutdown-to-wait-for-pod-deletion-propagation-445f779a8304
annotations = {
"config.linkerd.io/inbound-port" = "80"
}
I don't think you want this setting. Linkerd will transparently proxy connections without you setting anything.
This setting configures Linkerd's proxy to try to listen on port 80. This would likely conflict with your web server's port configuration; but the specific error you're hitting is that the Linkerd proxy does not run as root and so it does not have permission to bind port 80.
I'd expect it all to work if you removed that annotation :)
I am running my fabric network on kubernetes and I have setup ca servers for all the organisations. I am able to register and enroll the user from the cli but when i am using the fabric-ca-client library with nodejs to register and enroll the users. I am facing the CONNECTION Timeout issue, also at the same time if I look at the logs of my ca-server it show that is able to process the request.
Edit1: I am using the same code provided in fabric-sample to register and enroll the users.
All the all the pods are communicating with each other using these services in kubernetes
this is how my connection profile looks
"certificateAuthorities": {
"ca-org2": {
"url": "https://ca-org2:8054",
"caName": "ca-org2",
"tlsCACerts": {
"pem": ["-----BEGIN CERTIFICATE-----\nMIICBjCCAa2gAwIBAgIUHwBYatG6KhezYWHxdGgYGqs77PIwCgYIKoZIzj0EAwIw\nYDELMAkGA1UEBhMCVUsxEjAQBgNVBAgTCUhhbXBzaGlyZTEQMA4GA1UEBxMHSHVy\nc2xleTEZMBcGA1UEChMQb3JnMi5leGFtcGxlLmNvbTEQMA4GA1UEAxMHY2Etb3Jn\nMjAeFw0yMTAzMjAxMDI4MDBaFw0zNjAzMTYxMDI4MDBaMGAxCzAJBgNVBAYTAlVL\nMRIwEAYDVQQIEwlIYW1wc2hpcmUxEDAOBgNVBAcTB0h1cnNsZXkxGTAXBgNVBAoT\nEG9yZzIuZXhhbXBsZS5jb20xEDAOBgNVBAMTB2NhLW9yZzIwWTATBgcqhkjOPQIB\nBggqhkjOPQMBBwNCAAQUIABkRhfPdwoy2QrCY3oh8ZuzP5OprZJawVXO2ojid3j4\nC9W4l46QXR5J7iG5MLczguPZWB9dZWygRQdUQeoAo0UwQzAOBgNVHQ8BAf8EBAMC\nAQYwEgYDVR0TAQH/BAgwBgEB/wIBATAdBgNVHQ4EFgQURx/h3nkH0fq+3TlRPnQW\nWTHbR7YwCgYIKoZIzj0EAwIDRwAwRAIgCF+vcLFERb+VHa6Att0rh5yhpMd0bHEn\nmkNo0YfKuX4CICodtpp6AKtNWXreskaN+kRMH8eDmwvxkhvTK68ejv8U\n-----END CERTIFICATE-----\n"]
},
"httpOptions": {
"verify": false
}
}
}
I found the solution to this issue. The issue was related to the connection timeout, my CA Server was receving the requests and able to process them also but due to the short timeout the request was being cancelled. The solution was to increase the connection timeout and request-timeout. The default value of timeouts is 3s and I increased it to 30s and it started working. The default configuration can be found here
{
"request-timeout" : 3000,
"tcert-batch-size" : 10,
"crypto-hash-algo": "SHA2",
"crypto-keysize": 256,
"crypto-hsm": false,
"connection-timeout": 3000
}
we can update the timeout values from source code of the fabric-ca-client library or simply can use the methods of fabric-common library to update the these configuration values like this.
const { Utils: utils } = require('fabric-common');
const path=require('path');
let config=utils.getConfig()
config.file(path.resolve(__dirname,'config.json'))
And here is our modified configuration file config.json
{
"request-timeout" : 30000,
"tcert-batch-size" : 10,
"crypto-hash-algo": "SHA2",
"crypto-keysize": 256,
"crypto-hsm": false,
"connection-timeout": 30000
}
i am trying to experiment with scaling one of my application pods running on my raspberry pi kubernetes cluster using HPA + custom metrics but ran into several issues which despite reading the documentations on https://github.com/DirectXMan12/k8s-prometheus-adapter and troubleshooting for the past 2 days, i am still having difficulties grasping why some problems are happening.
Firstly, i built an ARM-compatible image of k8s-prometheus-adapter and install it using helm. I can confirm its running properly by checking the pod logs.
I have also set up a script which sends raspberry pis temperature to pushgateway and i can query via this Prometheus query node_temp, which will return the following series
node_temp{job="kube4"} 42
node_temp{job="kube1"} 44
node_temp{job="kube2"} 39
node_temp{job="kube3"} 40
Now i want to be able to scale one of my application pods using the above temperature values as an experiment to understand better how it works.
Below is my k8s-prometheus-adapter helm values.yml file
image:
repository: jaanhio/k8s-prometheus-adapter-arm
tag: latest
logLevel: 7
prometheus:
url: http://10.17.0.12
rules:
default: false
custom:
- seriesQuery: 'etcd_object_counts'
resources:
template: <<.Resource>>
name:
as: "etcd_object"
metricsQuery: count(etcd_object_counts)
- seriesQuery: 'node_temp'
resources:
template: <<.Resource>>
name:
as: "node_temp"
metricsQuery: count(node_temp)
After installing via helm, i ran kubectl get apiservices and can see v1beta1.custom.metrics.k8s.io listed.
i then ran kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq and got the following
{
"kind": "APIResourceList",
"apiVersion": "v1",
"groupVersion": "custom.metrics.k8s.io/v1beta1",
"resources": [
{
"name": "jobs.batch/node_temp",
"singularName": "",
"namespaced": true,
"kind": "MetricValueList",
"verbs": [
"get"
]
},
{
"name": "jobs.batch/etcd_object",
"singularName": "",
"namespaced": true,
"kind": "MetricValueList",
"verbs": [
"get"
]
},
]
i then tried to query the value of the registered node_temp metrics using kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/jobs/*/node_temp but got the following response
Error from server (InternalError): Internal error occurred: unable to list matching resources
Questions:
Why is the node_temp metrics associated with jobs.batch resource type?
Why am i not able to retrieve the value of metrics via kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/jobs/*/node_temp?
What is a definitive way of figuring the path of the query? e.g /apis/custom.metrics.k8s.io/v1beta1/jobs/*/node_temp i kinda trial and error until i got see somewhat of a response. i also see some other path with namespaces in the query e.g /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/foo_metrics
Any help and advice will be greatly appreciate!
Why is the node_temp metrics associated with jobs.batch resource type?
It picks the labels attached to the prometheus metrics and tries to interpret them, in this case u have clearely "job-kube4"
Why am i not able to retrieve the value of metrics via kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/jobs/*/node_temp?
Metrics are namespaced, see the "namespaced:true" so you'll need "/apis/custom.metrics.k8s.io/v1beta1/namespaces//jobs//node_temp"
What is a definitive way of figuring the path of the query? e.g /apis/custom.metrics.k8s.io/v1beta1/jobs//node_temp i kinda trial and error until i got see somewhat of a response. i also see some other path with namespaces in the query e.g /apis/custom.metrics.k8s.io/v1beta1/namespaces//metrics/foo_metrics
Check https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md#api-paths
As a QA in our company I am daily user of kubernetes, and we use kubernetes job to create performance tests pods. One advantage of job, according to the docs, is
to create one Job object in order to reliably run one Pod to completion
But in our tests this feature will create infinite pods if previous ones fail, which will occupy resources of our team's shared cluster, and deleting such pods will take a lot of time. see this image:
Currently the job manifest is like this:
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
"name": "upgradeperf",
"namespace": "ntg6-grpc26-tts"
},
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "upgradeperfjob",
"image":
"mycompany.com:5000/ncs-cd-qa/upgradeperf:0.1.1",
"command": [
"python",
"/jmeterwork/jmeter.py",
"-gu",
"git#gitlab-pri-eastus2.dev.mycompany.net:mobility-ncs-tools/tts-cdqa-tool.git",
"-gb",
"upgradeperf",
"-t",
"JMeter/testcases/ttssvc/JMeterTestPlan_ttssvc_cmpsize.jmx",
"-JtestDataFile",
"JMeter/testcases/ttssvc/testData/avaml_opus.csv",
"-JthreadNum",
"3",
"-JthreadLoopCount",
"1500",
"-JresultsFile",
"results_upgradeperf_cavaml_opus_t3_l1500.csv",
"-Jhost",
"mtl-blade32-03.mycompany.com",
"-Jport",
"28416"
]
}
],
"restartPolicy": "Never",
"imagePullSecrets": [
{
"name": "docker-registry-secret"
}
]
}
}
}
}
In some cases, such as misconfiguring of ip/ports, 'reliably run one Pod to completion' is impossible and recreating pods is waste of time and resource.
So is it possible, and how to limit kubernetes job to create a maxium number(say 3) of pods if always fail?
Depending on your kubernetes version, you can resolve this problem with these methods:
set the option: restartPolicy: OnFailure, then the failed container will be restarted in the same Pod, so you will not get lots of failed Pods, instead you will get a Pod with lots of restart.
From Kubernetes 1.8 on, There is a parameter backoffLimit to control the restart policy of failed job. This parameter defines the retry times of the job before treating the job to be failed, default 6 times. For this parameter to work you must set the parameter restartPolicy: Never .
You probably didn't set restartPolicy: Never in your pod spec, add that and I would expect it matches your expected behaviors better.
Where can I find a list of performance counter names to be used with a Service Fabric cluster? There is a list published here, but I would need the actual exact names to be used in the cluster's ARM template. Currently I have the following configuration in the template :
"WadCfg": {
"DiagnosticMonitorConfiguration": {
"overallQuotaInMB": "1000",
"sinks": "applicationInsights",
"DiagnosticInfrastructureLogs": {},
"PerformanceCounters": {
"PerformanceCounterConfiguration": [
{
"counterSpecifier": "\\Processor(_Total)\\% Processor Time",
"sampleRate": "PT3M"
},
{
"counterSpecifier": "\\Memory\\Available MBytes",
"sampleRate": "PT3M"
}
]
}
But only the "Memory\Available MBytes" actually shows up in Application Insights.
Those counters are the actual windows performance counters. So you just need to look for them. Some examples:
http://techgenix.com/Key-Performance-Monitor-Counters/
http://www.appadmintools.com/documents/windows-performance-counters-explained/
Judging by all this information, the performance counters all follow the same pattern:
first column\second column
\\Processor(_Total)\\% Processor Time
\\Memory\\Available MBytes
\\Network Interface(*)\\Bytes Received/sec
...
You might be able to find some more counters by running typeperf directly on the service fabric VM and capturing the output. You can also run it locally to get an idea of what is possible.
http://defaultreasoning.com/2009/06/25/list-all-performance-counters-on-a-windows-computer-and-export-it-to-a-file/
C:>TypePerf.exe –q > counters.txt