Running E2E conformance Tests within a custom cluster as a job - kubernetes

I am thinking of running an e2e test as a job within my cluster Kubernetes cluster. So far I have the Makefile, the Docker Image created and image pushed to AWS. There are some tests that are currently failing which I am trying to debug but apart from that. Is there anything else I need to be aware of, any tips, hints or resources will be greatly appreciated.
Thank You.

I've ran Sonobuoy on a cluster and in parallel some simple load-testing.
I'm not sure exactly which test sonobuoy ran at the time but it was obvious the HTTP server I was testing had degraded performance.
So - If you want to have E2E running along side production workloads take into account it might effect users and services.

Related

How do you deploy GeoIP on ECS Fargate?

How to productionise https://hub.docker.com/r/fiorix/freegeoip such that it is launched as a Fargate task and Also how to take care of the geoipupdate functionality such that the GeoLite2-City.mmdb is updated in the task.
I have the required environment details like GEOIPUPDATE_ACCOUNT_ID, GEOIPUPDATE_LICENSE_KEY and GEOIPUPDATE_EDITION_IDS but could not understand the flow for deployment as there are two separate dockerfile/images for geoip as well as geoipupdate.
Has someone deployed this on Fargate? If yes could you please list down with the high level steps for the same. I have already tried researching if such a thing is deployed on ECS, but I can only find examples for Lambda and EC2.
Thanks

How to start pods on demand in Kubernetes?

I need to create pods on demand in order to run a program. it will run according to the needs, so it could be that for 5 hours there will be nothing running, and then 10 requests will be needed to process, and I might need to limit that only 5 will run simultaneously because of resources limitations.
I am not sure how to build such a thing in kubernetes.
Also worth noting is that I would like to create a new docker container for each run and exit the container when it ends.
There are many options and you’ll need to try them out. The core tool is HorizontalPodAutoscaler. Systems like KEDA build on top of that to manage metrics more easily. There’s also Serverless tools like knative or kubeless. Or workflow tools like Tekton, Dagster, or Argo.
It really depends on your specifics.

Timeout waiting for network interface provisioning to complete

Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.

Best way to deploy long-running high-compute app to GCP

I have a python app that builds a dataset for a machine learning task on GCP.
Currently I have to start an instance of a VM that we have, and then SSH in, and run the app, which will complete in 2-24 hours depending on the size of the dataset requested.
Once the dataset is complete the VM needs to be shutdown so we don't incur additional charges.
I am looking to streamline this process as much as possible, so that we have a "1 click" or "1 command" solution, but I'm not sure the best way to go about it.
From what I've read about so far it seems like containers might be a good way to go, but I'm inexperienced with docker.
Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down? How would I pass information to the container such as where to get the config file etc? It's conceivable that we will have multiple datasets being generated at the same time based on different config files.
Is there a better gcloud feature that suits our purpose more effectively than containers?
I'm struggling to get information regarding these basic questions, it seems like container tutorials are dominated by web apps.
It would be useful to have a batch-like container service that runs a container until its process completes. I'm unsure whether such a service exists. I'm most familiar with Google Cloud Platform and this provides a wealth of compute and container services. However -- to your point -- these predominantly scale by (HTTP) requests.
One possibility may be Cloud Run and to trigger jobs using Cloud Pub/Sub. I see there's async capabilities too and this may be interesting (I've not explored).
Another runtime for you to consider is Kubernetes itself. While Kubernetes requires some overhead in having Google, AWS or Azure manage a cluster for you (I strongly recommend you don't run Kubernetes yourself) and some inertia in the capacity of the cluster's nodes vs. the needs of your jobs, as you scale the number of jobs, you will smooth these needs. A big advantage with Kubernetes is that it will scale (nodes|pods) as you need them. You tell Kubernetes to run X container jobs, it does it (and cleans-up) without much additional management on your part.
I'm biased and approach the container vs image question mostly from a perspective of defaulting to container-first. In this case, you'd receive several benefits from containerizing your solution:
reproducible: the same image is more probable to produce the same results
deployability: container run vs. manage OS, app stack, test for consistency etc.
maintainable: smaller image representing your app, less work to maintain it
One (beneficial!?) workflow change if you choose to use containers is that you will need to build your images before using them. Something like Knative combines these steps but, I'd stick with doing-this-yourself initially. A common solution is to trigger builds (Docker, GitHub Actions, Cloud Build) from your source code repo. Commonly you would run tests against the images that are built but you may also run your machine-learning tasks this way too.
Your containers would container only your code. When you build your container images, you would pip install, perhaps pip install --requirement requirements.txt to pull the appropriate packages. Your data (models?) are better kept separate from your code when this makes sense. When your runtime platform runs containers for you, you provide configuration information (environment variables and|or flags) to the container.
The use of a startup script seems to better fit the bill compared to containers. The instance always executes startup scripts as root, thus you can do anything you like, as the command will be executed as root.
A startup script will perform automated tasks every time your instance boots up. Startup scripts can perform many actions, such as installing software, performing updates, turning on services, and any other tasks defined in the script.
Keep in mind that a startup script cannot stop an instance but you can stop an instance through the guest operating system.
This would be the ideal solution for the question you posed. This would require you to make a small change in your Python app where the Operating system shuts off when the dataset is complete.
Q1) Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down?
A1) Medium has a great article on installing a package from a private git repo inside a container. You can execute the dataset build before shutting down.
Q2) How would I pass information to the container such as where to get the config file etc?
A2) You can use ENV to set an environment variable. These will be available within the container.
You may consider looking into Docker for more information about container.

Kubernetes adhoc command scheduler

I am a beginner in kubernetes and have just started playing around with it. I have a use case to run commands that I take in through a UI(can be a server running inside or outside the cluster) in the kubernetes cluster. Let's say the commands are python scripts like HelloWorld.py etc. When I enter the command the server should launch a container which runs the command and exits. How do I go about this in Kubernetes? What should the scheduler look like?
You can try the training classes on katacoda.com as follows.
https://learn.openshift.com/
https://www.katacoda.com/courses/kubernetes
It's interactive handson, so it's interesting and easy to make sense of OpenShift and Kubernetes.
I hope it help you. ;)