Fargate service stops because "ELB health check" fails - amazon-ecs

I'm brand new in the AWS world and I have an issue with my Fargate task: it is always stopped because the health check seems to encounter an issue:
Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:REGION:IDENTIFIER:targetgroup/TG_NAME/TG_ID)
I've read a lot of posts and made a lot of tests before posting this... and now I'm hoping I'm missing something obvious for someone more familiar with AWS.
Here is where I am:
My service (Fargate) is included in a Security group with these permissions:
TYPE PROTOCOL PORT_RANGE SOURCE
--------------------------------------------
HTTP TCP 80 0.0.0.0/0 // normally, only this one
All traffic All All 0.0.0.0/0 // but because I'm quite desperate
All traffic All All ::/0
The associated Target Group has an health check defined like this:
Protocol: HTTP
Route: /awshealth
Port: Traffic port
...
Success codes: 200
From my logs, I know that my /awshealth route is called and answer a status 200:
Nevertheless my task stops after some times because of a health check issue (whereas I could request my server on the public DNS associated to my load balancer until this moment).
Does anyone could help me fix this?
Thanks in advance!
Note 1: My Load Balancer is associated to all my Availability zones (and all my subnets), share the same VPC and the same Security Groups as my Service.
Note 2: The service needs a few minutes to start and I've added a Health check grace period of 300 in my service.

It was a memory issue.
The server was starting correctly (which explains my 200 statuses on my /awshealth route)... but a few minutes later the CPU was running at 100% and the server shut down, which was bringing my Service to stop.
I've just added some memory and everything is ok now.

Related

Failed to accept an incoming connection: connection from "9.42.x.x" rejected, allowed hosts: "zabbix-server"

SUMMARY
I have installed zabbix on OpenShift cluster. I am trying to monitor a host(vm) outside the cluster but the zabbix server is unable to connect to it. In the /etc/zabbix/zabbix_agentd.conf file I have mentioned the DNS name of the server zabbix-server but it looks like there server is trying to connect through a different public IP. I am not sure what this IP is.
OS / ENVIRONMENT / Used docker-compose files
I applied the kubernetes.yaml file present in this repo - https://github.com/zabbix/zabbix-docker/blob/6.2/kubernetes.yaml - on an OpenShift cluster.
CONFIGURATION
In the /etc/zabbix/zabbix_agentd.conf file Server=zabbix-server.
STEPS TO REPRODUCE
Apply the kubernetes.yaml file on Openshift cluster and try to monitor any external vm.
EXPECTED RESULTS
The zabbix server should be able to connect to the vm.
ACTUAL RESULTS
Zabbix server logs.
Defaulted container "zabbix-server" out of: zabbix-server, zabbix-snmptraps
\*\* Updating '/etc/zabbix/zabbix_server.conf' parameter "DBHost": 'mysql-server'...added
287:20230120:060843.131 Zabbix agent item "system.cpu.load\[all,avg5\]" on host "Host-C" failed: first network error, wait for 15 seconds
289:20230120:060858.592 Zabbix agent item "system.cpu.num" on host "Host-C" failed: another network error, wait for 15 seconds
289:20230120:060913.843 Zabbix agent item "system.sw.arch" on host "Host-C" failed: another network error, wait for 15 seconds
289:20230120:060929.095 temporarily disabling Zabbix agent checks on host "Host-C": interface unavailable
Logs from the agent installed on the vm.
350446:20230122:103232.230 failed to accept an incoming connection: connection from "9.x.x.219" rejected, allowed hosts: "zabbix-server"
350444:20230122:103332.525 failed to accept an incoming connection: connection from "9.x.x.219" rejected, allowed hosts: "zabbix-server"
350445:20230122:103432.819 failed to accept an incoming connection: connection from "9.x.x.210" rejected, allowed hosts: "zabbix-server"
350446:20230122:103533.114 failed to accept an incoming connection: connection from "9.x.x.217" rejected, allowed hosts: "zabbix-server"
If I add this IP in /etc/zabbix/zabbix_agentd.conf it will work. But what IP is this? Is this a service? Or any node/pod IP? It keeps on changing. Everytime I cannot change this id in the conf file. I need something more stable.
Kindly help me out with this issue.
So I don't know zabbix. So I have to make some educated guesses both in how the agent works and how the server works.
But, to summarize, unlike something like docker compose where you are running the zabbix server on a known server, in Openshift/Kubernetes you are deploying into a cluster of machines with their own networking. In other words, the whole point of OpenShift is that OpenShift will control where the application's pod gets deployed and will relocate/restart that pod as needed. With a different IP every time. (And the DNS name is meaningless since the two systems aren't sharing DNS anyway.) Most likely the IP's you are seeing are the pod's randomly assigned IPs.
So, what are you to do when you have a situation like yours where an external application requires a predicable IP? Well, option 1, is to remove that requirement. Using something like a certificate is obviously more secure and more reliable than depending on an IP anyway. But another option is to use an egress IP. This is a feature of OpenShift where you essentially use a proxy to provide an external application with a consistent IP.

K3s dial tcp lookup server misbehaving during letsencrypt staging

After succesfully hosting a first service on a single node cluster I am trying to add a second service with both its own dnsName.
The first service uses LetsEncrypt succesfully and now I am trying out the second service with a test-certifcate and the staging endpoint/clusterissuer
The error I am seeing once I describe the Letsencrypt Order is:
Waiting for HTTP-01 challenge propagation: failed to perform self check GET request 'http://example.nl/.well-known/acme-challenge/9kdpAMRFKtp_t8SaCB4fM8itLesLxPkgT58RNeRCwL0': Get "http://example.nl/.well-known/acme-challenge/9kdpAMRFKtp_t8SaCB4fM8itLesLxPkgT58RNeRCwL0": dial tcp: lookup example.nl on 10.43.0.11:53: server misbehaving
The port that is misbehaving is pointing to the internal IP of my service/kube-dns, which means it is past my service/traefik i think.
The cluster is running on a VPS and I have also checked the example.nl domain name is added to /etc/hosts with the VPS's ip like so:
206.190.101.190 example1.nl
206.190.101.190 example.nl
The error is a bit vague to me because I do not know exactly what de kube-dns is doing and why it thinks the server is misbehaving, I think maybe it is because it has now 2 domain names to handle I missed something. Anyone can shed some light on it?
Feel free to ask for more ingress or other server config!
Everything was setup right to be able to work, however this issue had definitely had something to do with DNS resolving. Not internally in the k3s cluster, but externally at the domain registrar.
I found it by using https://unboundtest.com for my domain and saw my old namespaces still being used.
Contacted the registrar and they had to change something for the domain in the DNS of the registry.
Pretty unique situation, but maybe helpful for people who also think the solution has to be found internally (inside k3s).

HAProxy: Random Layer4 timeouts in when using httpchk

I'm currently investigating an issue of random "Layer4" timeouts being reported by health checks used in HAPRoxy. The actual backend server being checked is proven to be up and responding at the time of these errors, as other trafic to the server is flowing through.
Thus making me suspect there might be issues caused by our configuration.
Server health is currently configured as follow:
option httpchk GET /health HTTP/1.1\r\nHost:\ Haproxy\r\nConnection:\ close
http-check expect string OK
server server1 server1.internal.example.com check check-ssl port 443 verify none inter 3s fall 2 backup
Trying to understand the docs, I see the "http-check connect" and "linger" options being mentioned. Would the "connect" directive make any actual difference to how the connection for the health check is set up compared to our current conf?
Any other feedback/observartions on the above config is welcome.

TCP socket health check instead of HTTP health check on EC2 target group?

I have a TCP service. I created a TCP readiness probe for my service which appears to be working just fine.
Unfortunately, my EC2 target group wants to perform an HTTP health check on my instance. My service doesn't respond to HTTP requests, so my target group is considering my instance unhealthy.
Is there a way to change my target group's health check from "does it return an HTTP success response?" to "can a TCP socket be opened to it?"
(I'm also open to other ways of solving the problem if what I suggested above isn't possible or doesn't make sense.)
TCP is a valid protocol for health checks in 2 cases:
the classic flavor of the ELB, see docs
The network load balancer, see docs
in case you're stuck with the Application Load Balancer - the only idea that comes to mind is to add a sidecar container that will respond to HTTP/HTTPS based on your TCP status. You could easily do this with nginx, although it would probably be quite an overkill.

Are service addresses available to the dc/os host OS?

I’m trying to have my dc/os 1.8 docker containers send log messages to a logstash that is also running in dc/os by using the service address of the logstash service.
that doesn’t appear to work as docker throws an error: logstash.marathon.l4lb.thisdcos.directory: no such host
are service addresses not exposed to the host systems (or do I need to configure something for this)?
on dc/os 1.7 I used a fixed host port in my logstash config and logstash.marathon.mesos as host, but these .marathon.mesos hostnames seem to not exist in 1.8 anymore.
the service addresses work fine when I try to use them from within a container (for example to link my prometheus service to my alertmanager service). but from the host level they don’t exist.
EDIT:
my statement about the missing marathon.mesos urls was wrong. they do work, but I uses the wrong one. for now this fixes my problem kind of. I configured logging using this host and a fixed container port.
for everybody trying the same thing: you have to configure the fixed host port everytime you make changes to the service config in the ui via the json mode. the fixed host port config is no longer available in the network tab of the ui, so the dc/os ui will DELETE the host port config on every load.
still no idea why the l4lb urls don't work.
EDIT2
still no idea, but i figured out that minuteman generates crash and error logs every other second:
/opt/mesosphere/active/minuteman/minuteman/error.log:
CRASH REPORT Process <0.25809.2> with 0 neighbours exited with reason: {timeout,{gen_server,call,[{lashup_kv,'navstar#10.2.140.216'},{start_kv_sync_fsm,'minuteman#10.2.103.143',<0.25809.2>}]}} in gen_server:call/2 line 204
/opt/mesosphere/active/minuteman/minuteman/log/crash.log
2016-10-12 13:16:49 =CRASH REPORT====
crasher:
initial call: lashup_kv_sync_tx_fsm:init/1
pid: <0.29002.2>
registered_name: []
exception exit: {{timeout,{gen_server,call,[{lashup_kv,'navstar#10.2.140.216'},{start_kv_sync_fsm,'minuteman#10.2.103.143',<0.29002.2>}]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,204}]},{lashup_kv_sync_tx_fsm,init,1,[{file,"/pkg/src/minuteman/_build/default/lib/lashup/src/lashup_kv_sync_tx_fsm.erl"},{line,23}]},{gen_statem,init_it,6,[{file,"gen_statem.erl"},{line,554}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
ancestors: [lashup_kv_aae_sup,lashup_kv_sup,lashup_platform_sup,lashup_sup,<0.916.0>]
messages: []
links: [<0.992.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 610
stack_size: 27
reductions: 127
neighbours:
the dc/os ui claims spartan and minuteman are healthy, but while the crash.log of the dns dispatcher is empty the l4lb gets new crashes every other second.
They should certainly be available from the host OS. Are these host services running the "Spartan" and "Minuteman" services?
my problem was twofold:
the l4b did not properly run, that was only fixed after a total reinstall of the cluster
the l4b only supports TCP traffic. because i wanted to use it to send container-logs to logstash using udp (docker-gelf only supports UDP) this failed