How to emulate network failures (chaos testing) on clusters with cilium - ebpf

Could you please provide me the information about the available tools for emulating network failures on Cilium/eBPF-based Service Mesh solutions?
Previously I used Chaos Mesh https://chaos-mesh.org/ but emulating network-related issues (packet delay or loss) doesn't work.
I suppose Chaos Mesh simulates network issues by manipulating with /etc/hosts, while Cilium bypasses it.
I tested https://github.com/cilium/chaos-testing-examples but it seems no longer maintained or supported, thus not working.
Any help appreciated!
Kind regards

Related

Losing SSH Connectivity repeatedly on Google Compute Engine

I ve been coding on vscode remotely connected to an instance on the google Compute Engine. I have an internet connection speed of around 30-40 mbps. What I have observed is I keep losing connection to the remote machine very frequently. What I have also observed is there are times when this issue occurs especially when certain memory intensive operations are run. So,
Question 1: Is there a relationship between RAM and ssh connectivity.
Question 2: Is my internet connectivity speed a problem? If yes what is the minimum amount of speed necessary a seamless coding experience.
The only relationship between the RAM and the SSH service is that the SSH is also using RAM to be able to operate. In your case, you already got a clue that the SSH Service crashes from time to time and mostly when performing memory intensive operations. Your machine is falling short on resources and hence in order to keep the OS up, the process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
With your current speed, connection is not an issue.
One of the best ways to tackle this is:
increase the resources of your VM (RAM)
then go back to code and check the requirements and limitation of your app
You can also check this official SSH Troubleshooting guide from google. Troubleshooting SSH

Kubernetes - Monitoring networking latency/bandwidth between servers

I have a Kubernetes cluster running in the Google Cloud consisting of 4 servers. What is the easiest way to monitor networking latency/bandwidth between the servers ?
Is it possible to have a "scriptable" solution so I can repeat the deployment on different clusters in the future with minimal need for overhead ?
Thanks
PS - Kind of new at this so apologies if I didn't get the terms exact
Have a look at that post: http://paulbakker.io/docker/docker-cloud-network-performance/
They use a Docker Image on each host and test network performance between two machines. Maybe that's something you could do and even script ;)

How is red/black deployment strategy achieved?

I recently ran across this Netflix Blog article http://techblog.netflix.com/2013/08/deploying-netflix-api.html
They are talking about red/black deployment where they run the old and new code side by side and direct the production traffic to both of them. If something goes wrong they do a rollback.
How does the directing of the traffic work? and is it possible to adapt this strategy with e.g two Docker containers?
One way of directing traffic is using Weighted Routing, as you can do in AWS Route 53.
Initially you have 100% traffic going to server(s) with old code. Then gradually you change that to have some traffic to server(s) with new code.
Also, as you can read in this blog, you can use Docker to achieve it:
Even with the best testing, things can go wrong after deployment and a
rollback may be required. Containers make this easy and we’ve brought
similar tools to the operating system with Project Atomic. Red/Black
deployments can be done throughout the entire stack with Atomic and
Docker.
I think they use Spinnaker to implement a red/black strategy. https://spinnaker.io/docs/concepts/

Messaging between torque jobs in a cluster

So, I need to submit computation intensive jobs (deep neural network training) to a torque cluster that lease computation time on, and I need to exchange a few megs of large float arrays every few minutes between the active nodes, as the nodes need to be working on the most recent version of the neural network in order to train it well.
I was wondering if there were any good communication options, at least to tell each active job its sisters jobs' ips so it can connect to them by tcp. The nodes don't have access to the internet, and we can't have daemons working on the job submitting server.
The only options that I see would be:
some message passing option on Torque (I'm am fairly noob at torque)
the very error prone option of using files to communicate, which I hate.
a way to query the ips of the active nodes from the server.
There are a variety of ways to exchange information between nodes on a cluster, depending on the architecture of the cluster. Torque is a resource manager, so if the job is being submitted to the cluster using a batch script there are a few environment variables that should be able to give you the hostnames or IP addresses of the nodes being used on a job.
The exact syntax for finding the IP addresses and/or hostnames will depend on the scheduler/workload manager being used with Torque on your cluster. This link has documentation for the PBS Works workload manager.
Parallel communication between nodes can be achieved in a variety of ways and will be partly dependent on the hardware available in the cluster. Using MPI is one of the most common ways to parallelize code for use on a cluster and many implementations support multiple high-performance fabrics/interconnect systems like Infiniband. Some useful introductions to the different types of parallelism can be found here.
As an alternative to MPI Remote Direct Memory Access(RDMA) can be used to pass and access information between nodes. If the cluster has Infiniband network adapters looking into the IB-Verbs API from the vendor would be an additional option for passing data between nodes.

How to deploy Node.js in cloud for high availability using multi-core, reverse-proxy, and SSL

I have posted this to ServerFault, but the Node.js community seems tiny there, so I'm hoping this bring more exposure.
I have a Node.js (0.4.9) application and am researching how to best deploy and maintain it. I want to run it in the cloud (EC2 or RackSpace) with high availability. The app should run on HTTPS. I'll worry about East/West/EU full-failover later.
I have done a lot of reading about keep-alive (Upstart, Forever), multi-core utilities (Fugue, multi-node, Cluster), and proxy/load balancers (node-http-proxy, nginx, Varnish, and Pound). However, I am unsure how to combine the various utilities available to me.
I have this setup in mind and need to iron out some questions and get feedback.
Cluster is the most actively developed and seemingly popular multi-core utility for Node.js, so use that to run 1 node "cluster" per app server on non-privileged port (say 3000). Q1: Should Forever be used to keep the cluster alive or is that just redundant?
Use 1 nginx per app server running on port 80, simply reverse proxying to node on port 3000. Q2: Would node-http-proxy be more suitable for this task even though it doesn't gzip or server static files quickly?
Have minimum 2x servers as described above, with an independent server acting as a load balancer across these boxes. Use Pound listening 443 to terminate HTTPS and pass HTTP to Varnish which would round robin load balance across the IPs of servers above. Q3: Should nginx be used to do both instead? Q4: Should AWS or RackSpace load balancer be considered instead (the latter doesn't terminate HTTPS)
General Questions:
Do you see a need for (2) above at all?
Where is the best place to terminate HTTPS?
If WebSockets are needed in the future, what nginx substitutions would you make?
I'd really like to hear how people are setting up current production environments and which combination of tools they prefer. Much appreciated.
It's been several months since I asked this question and not a lot of answer flow. Both Samyak Bhuta and nponeccop had good suggestions, but I wanted to discuss the answers I've found to my questions.
Here is what I've settled on at this point for a production system, but further improvements are always being made. I hope it helps anyone in a similar scenario.
Use Cluster to spawn as many child processes as you desire to handle incoming requests on multi-core virtual or physical machines. This binds to a single port and makes maintenance easier. My rule of thumb is n - 1 Cluster workers. You don't need Forever on this, as Cluster respawns worker processes that die. To have resiliency even at the Cluster parent level, ensure that you use an Upstart script (or equivalent) to daemonize the Node.js application, and use Monit (or equivalent) to watch the PID of the Cluster parent and respawn it if it dies. You can try using the respawn feature of Upstart, but I prefer having Monit watching things, so rather than split responsibilities, I find it's best to let Monit handle the respawn as well.
Use 1 nginx per app server running on port 80, simply reverse proxying to your Cluster on whatever port you bound to in (1). node-http-proxy can be used, but nginx is more mature, more featureful, and faster at serving static files. Run nginx lean (don't log, don't gzip tiny files) to minimize it's overhead.
Have minimum 2x servers as described above in a minimum of 2 availability zones, and if in AWS, use an ELB that terminates HTTPS/SSL on port 443 and communicates on HTTP port 80 to the node.js app servers. ELBs are simple and, if you desire, make it somewhat easier to auto-scale. You could run multiple nginx either sharing an IP or round-robin balanced themselves by your DNS provider, but I found this overkill for now. At that point, you'd remove the nginx instance on each app server.
I have not needed WebSockets so nginx continues to be suitable and I'll revisit this issue when WebSockets come into the picture.
Feedback is welcome.
You should not bother serving static files quickly. If your load is small - node static file servers will do. If your load is big - it's better to use a CDN (Akamai, Limelight, CoralCDN).
Instead of forever you can use monit.
Instead of nginx you can use HAProxy. It is known to work well with websockets. Consider also proxying flash sockets as they are a good workaround until websocket support is ubiquitous (see socket.io).
HAProxy has some support for HTTPS load balancing, but not termination. You can try to use stunnel for HTTPS termination, but I think it's too slow.
Round-robin load (or other statistical) balancing works pretty well in practice, so there's no need to know about other servers' load in most cases.
Consider also using ZeroMQ or RabbitMQ for communications between nodes.
This is an excellent thread! Thanks to everyone that contributed useful information.
I've been dealing with the same issues the past few months setting up the infrastructure for our startup.
As people mentioned previously, we wanted a Node environment with multi-core support + web sockets + vhosts
We ended up creating a hybrid between the native cluster module and http-proxy and called it Drone - of course it's open sourced:
https://github.com/makesites/drone
We also released it as an AMI with Monit and Nginx
https://aws.amazon.com/amis/drone-server
I found this thread researching how to add SSL support to Drone - tnx for recommending ELB but I wouldn't rely on a proprietary solution for something so crucial.
Instead I extended the default proxy to handle all the SSL requests. The configuration is minimal while the SSL requests are converted to plain http - but I guess that's preferable when you're passing traffic between ports...
Feel free to look into it and let me know if it fits your needs. All feedback welcomed.
I have seen AWS load balancer to load balance and termination + http-node-proxy for reverse proxy, if you want to run multiple service per box + cluster.js for mulicore support and process level failover doing extremely well.
forever.js on cluster.js could be good option for extreme care you want to take in terms of failover but that's hardly needed.