Custom Image Pulled Everytime in Google Dataproc Serverless - google-cloud-dataproc

I am using the custom image in the Dataproc Serverless. When I execute job, it is pulling image every time. This adds 1 mins extra processing time. We will be executing 1000 plus job in production and it will add lot of performance bottle neck.
Is there anyway we can tell Dataproc to cache image such that it does not pull every time?
Pulling image us.gcr.io/docker_image:version
About to run 'docker pull us.gcr.io/docker_image:version' with retries...
1.5: Pulling from docker_image
5eb5b503b376: Already exists
7967823e23a4: Pulling fs layer
8d68a13eb796: Pulling fs layer
72ed51b4aa20: Pulling fs layer
7967823e23a4: Download complete
7967823e23a4: Pull complete
8d68a13eb796: Verifying Checksum
8d68a13eb796: Download complete
8d68a13eb796: Pull complete
72ed51b4aa20: Download complete
72ed51b4aa20: Pull complete

Not yet, this is WIP and should be available in couple months.
Update: image streaming support for container images hosted in Google Artifacts Registry were released to GA on October 1st 2022.

Related

Make Airflow KubernetesPodOperator clear image cache without setting image_pull_policy on DAG?

I'm running Airflow on Google Composer. My tasks are KubernetesPodOperators, and by default for Composer, they run with the Celery Executor.
I have just updated the Docker image that one of the KubernetesPodOperator uses, but my changes aren't being reflected. I think the cause is that it is using a cached version of the image.
How do I clear the cache for the KubernetesPodOperator? I know that I can set image_pull_policy=Always in the DAG, but I want it to use cached images in the future, I just need it to refresh the cache now.
Here is how my KubernetesPodOperator (except for the commented line):
processor = kubernetes_pod_operator.KubernetesPodOperator(
task_id='processor',
name='processor',
arguments=[filename],
namespace='default',
pool='default_pool',
image='gcr.io/proj/processor',
# image_pull_policy='Always' # I want to avoid this so I don't have to update the DAG
)
Update - March 3, 2021
I still do not know how to make the worker nodes in Google Composer reload their images once while using the :latest tag on images (or using no tag, as the original question states).
I do believe that #rsantiago's comment would work, i.e. doing a rolling restart. A downside of this approach that I see is that, by default, in Composer worker nodes run in the same node pool as the Airflow infra itself. This means that doing a rolling restart would possibly affect the Airflow scheduling system, Web interface, etc. as well, although I haven't tried it so I'm not sure.
The solution that my team has implemented is adding version numbers to each image release, instead of using no tag, or the :latest tag. This ensures that you know exactly which image should be running.
Another thing that has helped is adding core.logging_level=DEBUG to the "Airflow Configuration Overrides" section of Composer. This will output the command that launched the docker image. If you're using version tags as suggested, this will display that tag.
I would like to note that setting up local debugging has helped tremendously. I am using PyCharm with the docker image as a "Remote Interpreter" which allows me to do step-by-step debugging inside the image to be confident before I push a new version.

In teraform, is there a way to refresh the state of a resource using TF files without using CLI commands?

I have a requirement to refresh the state of a resource "ibm_is_image" using TF files without using CLI commands ?
I know that we can import the state of a resource using "terraform import ". But I should do the same using IaC in TF files.
How to achieve this ?
Example:
In workspace1, I create a resource "f5_custom_image" which gets deleted later from command line. In workspace2, the same code in TF file will assume that "f5_custom_image" already exists and it fails to read the custom image resource. So, my code has to refresh the terraform state of this resource for every execution of "terraform apply":
resource "ibm_is_image" "f5_custom_image" {
depends_on = ["data.ibm_is_images.custom_images"]
href = "${local.image_url}"
name = "${var.vnf_vpc_image_name}"
operating_system = "centos-7-amd64"
timeouts {
create = "30m"
delete = "10m"
}
}
In Terraform's model, an object is fully managed by a single Terraform configuration and nothing else. Having an object be managed by multiple configurations or having an object be created by Terraform but then deleted later outside of Terraform is not a supported workflow.
Terraform is intended for managing long-lived architecture that you will gradually update over time. It is not designed to manage build artifacts like machine images that tend to be created, used, and then destroyed.
The usual architecture for this sort of use-case is to consider the creation of the image as a "build" step, carried out using some other software outside of Terraform, and then we use Terraform only for the "deploy" step, at which point the long-lived infrastructure is either created or updated to use the new image.
That leads to a build and deploy pipeline with a series of steps like this:
Use separate image build software to construct the image, and record the id somewhere from which it can be retrieved using a data source in Terraform.
Run terraform apply to update the long-lived infrastructure to make use of the new image. The Terraform configuration should include a data block to read the image id from wherever it was recorded in the previous step.
If desired, destroy the image using software outside of Terraform once Terraform has completed.
When implementing a pipeline like this, it's optional but common to also consider a "rollback" process to use in case the new image is faulty:
Reset the recorded image id that Terraform is reading from back to the id that was stored prior to the new build step.
Run terraform apply to update the long-lived infrastructure back to using the old image.
Of course, supporting that would require retaining the previous image long enough to prove that the new image is functioning correctly, so the normal build and deploy pipeline would need to retain at least one historical image per run to roll back to. With that said, if you have a means to quickly recreate a prior image during rollback then this special workflow isn't strictly needed: instead, you can implement rollback instead by "rolling forward" to an image constructed with the prior configuration.
An example software package commonly used to prepare images for use with Terraform on other cloud vendors is HashiCorp Packer, but sadly it looks like it does not have IBM Cloud support and so you may need to look for some similar software that does support IBM Cloud, or write something yourself using the IBM Cloud SDK.

CannotPullContainerError: AWS Batch Job

I am trying to run a job in AWS Batch. This is my first attempt.
I have a python script which reads files from a S3 bucket, processes them and makes tables in RDS Postgres.
I have made a docker image with my script, pandas, boto3, SQLAlchemy and pushed it to hub.docker.com
When I try to run a job in AWS Batch it get the below error -
CannotPullContainerError: Error response from daemon: pull access denied for *dockerimagename*, repository does not exist or may require 'docker login'
What is a possible solution? I am stuck with this for a long time.
I had this issue when I was only putting the image name in the Container Image field of the Job Description. So I was putting:
*dockerimagename*
when I should have been putting:
0123456789.dkr.ecr.us-east-1.amazonaws.com/*dockerimagename*
You can get the first part of that by going to your ECR > Repositories in the AWS console and copying the link from there (there's even a button to do it).

Deployed jobs stopped working with an image error?

In the last few hours I am no longer able to execute deployed Data Fusion pipeline jobs - they just end in an error state almost instantly.
I can run the jobs in Preview mode, but when trying to run deployed jobs this error appears in the logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Selected software image version '1.2.65-deb9' can no longer be used to create new clusters. Please select a more recent image
I've tried with both an existing instance and a new instance, and all deployed jobs including the sample jobs give this error.
Any ideas? I cannot find any config options for what image is used for execution
We are currently investigating an issue with the image for Cloud Dataproc used by Cloud Data Fusion. We had pinned a version of Dataproc VM image for the launch that is causing an issue.
We apologize for you inconvenience. We are working to resolve the issue as soon as possible for you.
Will provide update on this thread.
Nitin

Coordinating Job containers and Volumes in a CI system

I'm working on a tinker Kubernetes-based CI system, where each build gets launched as a Job. I'm running these much like Drone CI does, in that each step in the build is a separate container. In my k8s CI case, I'm running each step as a container within a Job pod. Here's the behavior I'm after:
A build volume is created. All steps will mount this. A Job is fired
off with all of the steps defined as separate containers, in order of
desired execution.
The git step (container) runs, mounting the shared volume and cloning
the sources.
The 'run tests' step mounts the shared volume to a container spawned
from an image with all of the dependencies pre-installed.
If our tests pass, we proceed to the Slack announcement step, which is
another container that announces our success.
I'm currently using a single Job pod with an emptyDir Volume for the shared build space. I did this so that we don't have to wait while a volume gets shuffled around between nodes/Pods. This also seemed like a nice way to ensure that things get cleaned up automatically at build exit.
The problem becomes that if I fire up a multi-container Job with all of the above steps, they execute at the same time. Meaning the 'run tests' step could fire before the 'git' step.
I've thought about coming up with some kind of logic in each of these containers to sleep until a certain unlock/"I'm done!" file appears in the shared volume, signifying the dependency step(s) are done, but this seems complicated enough to ask about alternatives before proceeding.
I could see giving in and using multiple Jobs with a coordinating Job, but then I'm stuck getting into Volume Claim territory (which is a lot more complicated than emptyDir).
To sum up the question:
Is my current approach worth pursuing, and if so, how to sequence the Job's containers? I'm hoping to come up with something that will work on AWS/GCE and bare metal deployments.
I'm hesitant to touch PVCs, since the management and cleanup bit is not something I want my system to be responsible for. I'm also not wanting to require networked storage when emptyDir could work so well.
Edit: Please don't suggest using another existing CI system, as this isn't helpful. I am doing this for my own gratification and experimentation. This tinker CI system is unlikely to ever be anything but my toy.
If you want all the build steps to run in containers, GitLab CI or Concourse CI would probably be a much better fit for you. I don't have experience with fabric8.io, but Frank.Germain suggests that it will also work.
Once you start getting complex enough that you need signaling between containers to order the build steps it becomes much more productive to use something pre-rolled.
As an option you could use a static volume (i.e. a host path as an artifact cache, and trigger the next container in sequence from the current container, mounting the same volume between the stages. You could then just add a step to the beginning or end of the build to 'clean up' after your pipeline has been run.
To be clear: I don't think that rolling your own CI is the most effective way to handle this, as there are already systems in place that will do exactly what you are looking for