Create Dataproc cluster with custom image: Failed to resolve image version - google-cloud-dataproc

When trying to create Dataproc cluster with custom image I am getting this ...
│ Error: Error creating Dataproc cluster: googleapi: Error 400: Failed to resolve image version 'my-ubuntu18-custom'. Accepted image versions: [preview, 2.0-centos, 1.2-deb9, 1.3-debian9, 1.2-debian9, 1.5-debian10, 1.5-centos8, 2.0-ubuntu18, 1.0-debian9, 1.1-debian9, 1.5-ubuntu18, preview-centos8, preview-debian10, preview-ubuntu18, 2.0-debian, 1.4-debian9, preview-debian, 2.0-debian10, 1.1-debian, 1.0-debian, 1.2-debian, preview-ubuntu, 1.5-centos, 1.3-deb9, 1.4-ubuntu18, 1.5-ubuntu, preview-centos, 1.4-debian10, 1.1-deb9, 1.3-ubuntu, 1.4-ubuntu, 1.0, 1.1, 2.0, 1.2, 1.3, 1.3-debian10, 1.4, 2.0-ubuntu, 1.5, 1.4-debian, 1.5-debian, 1.0-deb9, 1.3-debian, 1.3-ubuntu18, 2.0-centos8]. See https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions for additional information on image versioning., badRequest
The custom images list shows the custom image however ...
$ gcloud compute images list --no-standard-images | grep NAME:
NAME: my-ubuntu18-custom

You have to use the --image instead of --image-version flag if creating the Dataproc cluster using gcloud, and you have to specify the full image URI instead of just the short name.
You can find the full URI by looking for the "selfLink" when getting the full details of the image:
gcloud compute images describe my-ubuntu18-custom | grep selfLink

Related

Github Action failing to Build Images for the plugins being used in workflow

I am trying to use a plugin in my eks based k8s cluster,
I am using a Github Action controller that spawns on demand Container as Self Hosted runner
When the Github action start this plugin or any other that needs to build itself as a docker image fails with below error, any thoughts or ideas ?
This is my self hosted runner image Link
FYI : If i run a standalone alpine container in the cluster all typical cmd works, and this also works with default ubuntu based self hosted runner, so i dont think its the cluster
/usr/local/bin/docker build -t 60e226:1b6fc15462134e6fb8520b7df48cf7fd -f "/runner/_work/_actions/aquasecurity/trivy-action/master/Dockerfile" "/runner/_work/_actions/aquasecurity/trivy-action/master"
Sending build context to Docker daemon 644.6kB
Step 1/5 : FROM ghcr.io/aquasecurity/trivy:0.[3](https://github.com//docker-images/actions/runs/4134005760/jobs/7147011143#step:3:3)7.1
0.37.1: Pulling from aquasecurity/trivy
c158987b0551: Pulling fs layer
67a7d067ef7d: Pulling fs layer[6]Download complete
67a7d067ef7d: Pull complete
2ec1cdd48f38: Verifying Checksum
2ec1cdd48f38: Download complete
2ec1cdd48f38: Pull complete
fe56e6aa700e: Pull complete
Digest: sha256:7c[16](https://github.com//docker-images/actions/runs/4134005760/jobs/7147011143#step:3:16)7f7f3002948f1ec099555aa968bd8b8b097780603a38cc801fe965da0a69
Status: Downloaded newer image for ghcr.io/aquasecurity/trivy:0.37.1
---> c3e68408cd24
Step 2/5 : COPY entrypoint.sh /
---> 1f1da443ea86
Step 3/5 : RUN apk --no-cache add bash curl npm
---> Running in 647f7f479cac
fetch https://dl-cdn.alpinelinux.org/alpine/v3.[17](https://github.com//docker-images/actions/runs/4134005760/jobs/7147011143#step:3:17)/main/x86_64/APKINDEX.tar.gz
48ABC73BEB7F0000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:[18](https://github.com//docker-images/actions/runs/4134005760/jobs/7147011143#step:3:18)89:
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.17/main: Permission denied
fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/community/x86_64/APKINDEX.tar.gz
48ABC73BEB7F0000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889:
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.17/community: Permission denied
ERROR: unable to select packages:
bash (no such package):
required by: world[bash]
curl (no such package):
required by: world[curl]
npm (no such package):
required by: world[npm]
The command '/bin/sh -c apk --no-cache add bash curl npm' returned a non-zero code: 3
Warning: Docker build failed with exit code 3, back off 6.807 seconds before retry.
It was expected to build the docker image and proceed with the github action workflow
Tried different flavors of image and nothing worked except for ubunut-latest
the plugin in question
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action#master
with:
image-ref: 'test:latest'
format: 'table'
exit-code: '1'
ignore-unfixed: true
vuln-type: 'os,library'
severity: 'CRITICAL,HIGH'

Docker compose to AWS ECS fails at the end

Im publishing a project via docker compose to AWS ECR but it fails on the last couple of steps. Its based on the new "docker compose" integration with an AWS context
The error i receive is:
MicroservicedocumentGeneratorService TaskFailedToStart: ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post https://api.ecr....
The image is in an ECR private repository along with the others from the compose file.
I have authenticated with:
aws ecr get-login-password
The docker compose is:
microservice_documentGenerator:
image: xxx.dkr.ecr.xxx.amazonaws.com/microservice_documentgenerator:latest
networks:
- publicnet
The original dockerfile is
FROM openjdk:11-jdk-slim
COPY /Microservice.DocumentGenerator/Microservice.DocumentGenerator.jar app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
The output for before the error was:
[+] Running 54/54
- projext DeleteComplete 355.3s
- PublicnetNetwork DeleteComplete 310.5s
- LogGroup DeleteComplete 306.1s
- MicroservicedocumentGeneratorTaskExecutionRole DeleteComplete 272.2s
- MicroservicedocumentGeneratorTaskDefinition Del... 251.2s
- MicroservicedocumentGeneratorServiceDiscoveryEntry DeleteComplete 220.1s
- MicroservicedocumentGeneratorService DeleteComp... 211.9s
try authentication with:
aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
Plus can you mention from where you are making the call and if the server has the permission to make the call to ECR?

Airflow- dag_id could not be found issue when using kubernetes executor

I am using airflow stable helm chart and using Kubernetes Executor, new pod is being scheduled for dag but its failing with dag_id could not be found issue. I am using git-sync to get dags. Below is the error and kubernetes config values. Can someone please help me resolve this issue?
Error:
[2020-07-01 23:18:36,939] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-01 23:18:36,940] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/dags/etl/sampledag_dag.py
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 37, in <module>
args.func(args)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 523, in run
dag = get_dag(args)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 149, in get_dag
'parse.'.format(args.dag_id))
airflow.exceptions.AirflowException: dag_id could not be found: sampledag . Either the dag did not exist or it failed to parse.
Config:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: false
AIRFLOW__KUBERNETES__GIT_REPO: git#git.com/dags.git
AIRFLOW__KUBERNETES__GIT_BRANCH: master
AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT: /dags
AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME: git-secret
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY: airflow-repo
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: tag
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"
sampledag
import logging
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
import os
args = {
'owner': 'airflow'
}
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
try:
print("Entered try block")
with models.DAG(
dag_id='sampledag',
schedule_interval=datetime.timedelta(days=1),
start_date=YESTERDAY) as dag:
print("Initialized dag")
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id='trigger-task',
# Name of task you want to run, used to generate Pod ID.
name='trigger-name',
namespace='scheduler',
in_cluster = True,
cmds=["./docker-run.sh"],
is_delete_operator_pod=False,
image='imagerepo:latest',
image_pull_policy='Always',
dag=dag)
print("done")
except Exception as e:
print(str(e))
logging.error("Error at {}, error={}".format(__file__, str(e)))
raise
I had the same issue. I solved it by adding the following to my config:
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH: repo/
What was happening is that the init container will download your dags in [AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT]/[AIRFLOW__KUBERNETES__GIT_SYNC_DEST] and AIRFLOW__KUBERNETES__GIT_SYNC_DEST by default is repo (https://airflow.apache.org/docs/stable/configurations-ref.html#git-sync-dest)
I am guessing that the problem could be incurred from the difference in your setup that causes: /opt/airflow/dags/dags/etl/sampledag_dag.py and AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT: /dags
I'd double check that these are what you want, and are what you expect.
I was facing the same issue while trying to use Kubernetes Executor using stable helm airflow chart. In my case, I was able to resolve it by changing
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000" to AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER: "65533" in the env section of helm chart.
Same value is mentioned in this link
I came to this conclusion as the init container (git sync) which was running before the temporary worker pod came up was not able to clone/sync the git dags to the worker pods. In my case, there was a permissions error (even when kube secret for ssh clone was correctly passed)
Note:
the git-sync init container returns no error even if it fails to fetch the DAGs
Kubernetes debugging information for init containers
kubectl get pods -n [NAMESPACE]
kubectl logs -n [NAMESPACE] [POD_ID] -c git-sync
Getting the same issue, I solved it with the suggestion from #gtrip to set the UID of the git-sync run user to 65533.
I would add the following debug hints:
the git-sync init container returns no error even if it fails to fetch the DAGs
Kubernetes debugging information for init containers
kubectl get pods -n [NAMESPACE]
kubectl logs -n [NAMESPACE] [POD_ID] -c git-sync

Bazel Kubernetes Object Error: no objects passed to apply (Google Container Registry)

I have a k8s_object rule to apply a deployment to my Google Kubernetes Cluster. Here is my setup:
load("#io_bazel_rules_docker//nodejs:image.bzl", "nodejs_image")
nodejs_image(
name = "image",
data = [":lib", "//:package.json"],
entry_point = ":index.ts",
)
load("#io_bazel_rules_k8s//k8s:object.bzl", "k8s_object")
k8s_object(
name = "k8s_deployment",
template = ":gateway.deployment.yaml",
kind = "deployment",
cluster = "gke_cents-ideas_europe-west3-b_cents-ideas",
images = {
"gcr.io/cents-ideas/gateway:latest": ":image"
},
)
But when I run bazel run //services/gateway:k8s_deployment.apply, I get the following error
INFO: Analyzed target //services/gateway:k8s_deployment.apply (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //services/gateway:k8s_deployment.apply up-to-date:
bazel-bin/services/gateway/k8s_deployment.apply
INFO: Elapsed time: 0.113s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Build completed successfully, 1 total action
$ /snap/bin/kubectl --kubeconfig= --cluster=gke_cents-ideas_europe-west3-b_cents-ideas --context= --user= apply -f -
2020/02/12 14:52:44 Unable to publish images: unable to publish image gcr.io/cents-ideas/gateway:latest
error: no objects passed to apply
error: no objects passed to apply
It doesn't push the new image to the Google Container Registry.
Strangely, this worked a few days ago. But I didn't change anything.
Here is the full code if you need to take a closer look: https://github.com/flolude/cents-ideas/blob/069c773ade88dfa8aff492f024a1ade1f8ed282e/services/gateway/BUILD
Update
I don't know if this has something to do with this issue but when I run
gcloud auth configure-docker
I get some warnings:
WARNING: `docker-credential-gcloud` not in system PATH.
gcloud's Docker credential helper can be configured but it will not work until this is corrected.
WARNING: Your config file at [/home/flolu/.docker/config.json] contains these credential helper entries:
{
"credHelpers": {
"asia.gcr.io": "gcloud",
"staging-k8s.gcr.io": "gcloud",
"us.gcr.io": "gcloud",
"gcr.io": "gcloud",
"marketplace.gcr.io": "gcloud",
"eu.gcr.io": "gcloud"
}
}
Adding credentials for all GCR repositories.
WARNING: A long list of credential helpers may cause delays running 'docker build'. We recommend passing the registry name to configure only the registry you are using.
gcloud credential helpers already registered correctly.
I had google-cloud-sdk installed via snap install. What I did to make it work is to remove google-cloud-sdk via
snap remove google-cloud-sdk
and then followed those instructions to install it via
sudo apt install google-cloud-sdk
Now it works fine

Spinnaker "Create Application" menu doesn't load

I'm quite new to the Spinnaker and have to ask for some help I guess. Does anyone knows why it could be that I can't create any Application and just keep seeing this screen.
My installation is through Halyard 1.5.0 and Ubuntu 14.04.
We don't use any cloud provider but I did configure Docker and Kubernetes part
And here is the error I see in the /var/log/spinnaker/echo/echo.log:
2017-11-16 13:52:29.901 INFO 13877 --- [ofit-/pipelines] c.n.s.echo.services.Front50Service : java.net.SocketTimeoutException: timeout
at okio.Okio$3.newTimeoutException(Okio.java:207)
at okio.AsyncTimeout.exit(AsyncTimeout.java:261)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:215)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:306)
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:300)
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:196)
at com.squareup.okhttp.internal.http.Http1xStream.readResponse(Http1xStream.java:186)
at com.squareup.okhttp.internal.http.Http1xStream.readResponseHeaders(Http1xStream.java:127)
at com.squareup.okhttp.internal.http.HttpEngine.readNetworkResponse(HttpEngine.java:739)
at com.squareup.okhttp.internal.http.HttpEngine.access$200(HttpEngine.java:87)
at com.squareup.okhttp.internal.http.HttpEngine$NetworkInterceptorChain.proceed(HttpEngine.java:724)
at com.squareup.okhttp.internal.http.HttpEngine.readResponse(HttpEngine.java:578)
at com.squareup.okhttp.Call.getResponse(Call.java:287)
at com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:243)
at com.squareup.okhttp.Call.getResponseWithInterceptorChain(Call.java:205)
at com.squareup.okhttp.Call.execute(Call.java:80)
at retrofit.client.OkClient.execute(OkClient.java:53)
at retrofit.RestAdapter$RestHandler.invokeRequest(RestAdapter.java:326)
at retrofit.RestAdapter$RestHandler.access$100(RestAdapter.java:220)
at retrofit.RestAdapter$RestHandler$1.invoke(RestAdapter.java:265)
at retrofit.RxSupport$2.run(RxSupport.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at retrofit.Platform$Base$2$1.run(Platform.java:94)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:204)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at okio.Okio$2.read(Okio.java:139)
at okio.AsyncTimeout$2.read(AsyncTimeout.java:211)
... 24 more
2017-11-16 13:52:29.901 INFO 13877 --- [ofit-/pipelines] c.n.s.echo.services.Front50Service : ---- END ERROR
#grizzthedj
thanks again for recommendations. It doesn't seem, however, solved the issue. I wonder if it has something to do with my Docker Registry or Kubernetes.
Here is what I have in my .hal/config:
dockerRegistry:
enabled: true
accounts:
- name: <hidden-name>
requiredGroupMembership: []
address: https://docker-registry.<hidden-name>.net/
cacheIntervalSeconds: 30
repositories:
- hellopod
- demoapp
primaryAccount: <hidden-name>
kubernetes:
enabled: true
accounts:
- name: <username>
requiredGroupMembership: []
dockerRegistries:
- accountName: <hidden-name>
namespaces: []
context: sre-os1-dev
namespaces:
- spinnaker
omitNamespaces: []
kubeconfigFile: /home/<username>/.kube/config
I suspect you may be using redis as the persistent storage type(I ran into the same issue).
If this is the case, persistent storage using redis doesn't seem to be working properly out-of-the-box, and it is not supported. I would try using an S3 target, if available.
More info here on support for redis
To configure S3 using Halyard, use the following commands:
echo <SECRET_ACCESS_KEY> | hal config storage s3 edit --access-key-id <ACCESS_KEY_ID> --endpoint <S3_ENDPOINT> --bucket <BUCKET_NAME> --root-folder spinnaker --secret-access-key
hal config storage edit --type s3
hal deploy apply
#grizzthedj,
Here is what I've found inside front50.log (I wiped out ID's of course for security reasons)
You may be right.
2017-11-20 12:40:29.151 INFO 682 --- [0.0-8080-exec-1] com.amazonaws.latency : ServiceName=[Amazon S3], AWSErrorCode=[NoSuchKey], StatusCode=[404], ServiceEndpoint=[https://s3-us-west-2.amazonaws.com], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: ...; S3 Extended Request ID: ...), S3 Extended Request ID: ...], RequestType=[GetObjectRequest], AWSRequestID=[...], HttpClientPoolPendingCount=0, RetryCapacityConsumed=0, HttpClientPoolAvailableCount=1, RequestCount=1, Exception=1, HttpClientPoolLeasedCount=0, ClientExecuteTime=[39.634], HttpClientSendRequestTime=[0.072], HttpRequestTime=[39.213], RequestSigningTime=[0.067], CredentialsRequestTime=[0.001, 0.0], HttpClientReceiveResponseTime=[39.059],
I had a similar issue on kubernetes/aws, when I opened up the chrome dev console I was getting lots of 404 errors trying to connect to localhost:8084, I had to reconfigure the deck and gate baseurls. This is what I did using halyard:
hal config security ui edit --override-base-url http://<deck-loadbalancer-dns-entry>:9000
hal config security api edit --override-base-url http://<gate-loadbalancer-dns-entry>:8084
i did hal deploy apply and when it came back I noticed the developer console was throwing cors errors so I had to do the following.
echo "host: 0.0.0.0" | tee \ ~/.hal/default/service-settings/gate.yml \ ~/.hal/default/service-settings/deck.yml
You may note the lack of TLS and cors config, this is a test system so make better choices in production :)