Migrating Established TCP connection with docker containers - sockets

Is it possible to transparently migrate an established TCP connection along with the Docker container from one node to another?
My use case is scaling/re-scheduling an web-app which relies on WebSockets but I believe there would be more use cases for other application protocols and plain tcp.
What I'm looking for is a way to do it completely transparently for client applications. I'm aware it's possible to reconnect upon disconnection but this is not what I need.
I've been looking at SockMI agent but it seems to be still in beta and missing documentation.
If I understand this correctly the migration would require the following at high-level:
Trigger scaling action (when it all needs to start)
Launch replacement container on new node
Freeze container's processes on original node
Put tcp connections on hold
Transfer the processes and their state across to new node
Migrate the TCP connection

Is it possible to transparently migrate an established TCP connection ... from one node to another?
No.

Related

Dynamic port mapping for ECS tasks

I want to run a socket program in aws ecs with client and server in one task definition. I am able to run it when I use awsvpc network mode and connect to server on localhost every time. This is good so I don’t need to know the IP address of server. The issue is server has to start on some port and if I run 10 of these tasks only 3 tasks(= number of running instances) run at a time. This is clearly because 10 tasks cannot open the same port. I can manually check for open ports before starting the server and somehow write it to docker shared volume where client can read and connect. But this seems complicated and my server has unnecessary code. For the Services there is dynamic port mapping by using Application Load Balancer but there isn’t anything for simply running tasks.
How can I run multiple socket programs without having to manage the port number in Aws ecs?
If you're using awsvpc mode, each task will get its own eni and there shouldn't be any port conflict. But each instance type has a limited number of enis available. You can increase that by enabling eni trunking which, however is supported by a handful of instance types:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html#eni-trunking-supported-instance-types

docker swarm - connections from wildfly to postgres randomly hang

I'm experiencing a weird problem when deploying a docker stack (compose file).
I have a three node docker swarm - master and two workers.
All machines are CentOS 7.5 with kernel 3.10.0 and docker 18.03.1-ce.
Most things run on the master, one of which is a wildfly (v9.x) application server.
On one of the workers is a postgres database.
After deploying the stack things work normally, but after a while (or maybe after a specific action in the web app) request start to hang.
Running netstat -ntp inside the wildfly container shows 52 bytes stuck in the Send-q:
tcp 0 52 10.0.0.72:59338 10.0.0.37:5432 ESTABLISHED -
On the postgres side the connection is also in ESTABLISHED state, but the send and receive queues are 0.
It's always exactly 52 bytes. I read somewhere that ACK packets with timestamps are also 52 bytes. Is there any way I can verify that?
We have the following sysctl tunables set:
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_timestamps = 0
The first three were needed because of this.
All services in the stack are connected to the same default network that docker creates.
Now if I move the postgres service to be on the same host as the wildfly service the problem doesn't seem to surface or if I declare a separate network for postgres and add it only to the services that need the database (and the database of course) the problem also doesn't seem to show.
Has anyone come across a similar issue? Can anyone provide any pointers on how I can debug the problem further?
Turns out this is a known issue with pooled connections in swarm with services on different nodes.
Basically the workaround is to set the above tuneables + enable tcp keepalive on the socket. See here and here for more details.

haproxy streaming - no dependency on proxy

I have been testing with haproxy which does cookie based load balancing to our streaming servers, but lets say for example haproxy falls over (I know unlikely)
the streamer gets disconnected, is there a way of passing on the connection without it relying on haproxy, basically laving the streamer connected to the destination and cutting all ties with haproxy.
That is not possible by design.
HAProxy is a proxy (as the name suggests). As such, for each communication, you have two independent TCP-connections, one between the client and HAProxy and another between HAProxy and your backend server.
If HAProxy fails or you need to failover, the standing connections will have to be re-created. You can't pass over existing connections to another server since there is a lot of state attached to each connection that can't be transferred.
If you want to remove the loadbalancer from the equation after the initial connection initialization, you should look at Layer-3 loadbalancing solutions like LVS on Linux with Direct Routing. Note that these solutions are much less flexible than HAProxy. There is no such thing as a free lunch after all :)

Amazon EC2 Elastic Load Balancer TCP disconnect after couple of hours

I am testing the reliability of TCP connections using Amazon Elastic Load Balancer compared to not using the Load Balancer to see if it has any impact.
I have setup a small Elastic Load Balancer on Amazon EC2 us-east zones with 8 t2.micro instances using an auto scaling group without policy and set to 8 min/max instance.
Each instance run a simple TCP server that accept connections on port 8017 and relay some data to the clients coming from another remote server located in my network. The same data is send to all clients.
For the purpose of the test, the servers running on the micro instances are only sending 1 byte of data every 60 seconds (to be sure the connection don't time out).
I connected multiple clients from various outside networks using the ELB DNS name provided, and after maybe 6-24 hours, I always stop receiving data and eventually the connections all die.
All clients stops around the same time, even though they are on different network/ISP. Each "client" application is doing about 10 TCP connections and they all stop receiving data.
All server instances look fine after this happen, they still send data.
To do further testing and eliminate the TCP server code problem, I also have external clients connected directly to the public IP of a single instance, without the ELB, and the data doesn't stop and the connection is not lost in this case (so far).
The Load balancer Idle Timeout is set to 900 seconds.
The Cross-Zone load balancing is enabled and I am using the following zones: us-east-1e, us-east-1b, us-east-1c, us-east-1d
I read the documentation, and searched everywhere to see if this is a known behaviour, but I couldn't find any clear answer or confirmation of others having the same issue, but it seems clear it is happening in my case.
My question: Is this a known/expected behaviour for TCP load balancer? Otherwise, any idea what could be the problem in my setup?

Can I use a ES client that creates a local node and joins a cluster over TCP on Heroku?

I'm hoping to connect to a remote ElasticSearch cluster from a Scala app running on Heroku. I've never used ES before, but as I understand it the most efficient way to connect from Java/Scala is to create a local data-less node that joins the cluster you want to query and talks to it over its native TCP interface. Is it possible (and / or allowed) to use such a client on Heroku?
Not at time of writing, you have to use the HTTP interface. This will restrict your choice of client libraries too.