Deployed jobs stopped working with an image error? - google-cloud-data-fusion

In the last few hours I am no longer able to execute deployed Data Fusion pipeline jobs - they just end in an error state almost instantly.
I can run the jobs in Preview mode, but when trying to run deployed jobs this error appears in the logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Selected software image version '1.2.65-deb9' can no longer be used to create new clusters. Please select a more recent image
I've tried with both an existing instance and a new instance, and all deployed jobs including the sample jobs give this error.
Any ideas? I cannot find any config options for what image is used for execution

We are currently investigating an issue with the image for Cloud Dataproc used by Cloud Data Fusion. We had pinned a version of Dataproc VM image for the launch that is causing an issue.
We apologize for you inconvenience. We are working to resolve the issue as soon as possible for you.
Will provide update on this thread.
Nitin

Related

CloudFormation: stack is stuck, CloudTrail events shows repeating DeleteNetworkInterface event

I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?
I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

How to overcome the IllegalAccessError while start up of connector in Kafka

I am writing a connector for Kafka Connect. The error I see during the start up of connector is
java.lang.IllegalAccessError: tried to access field org.apache.kafka.common.config.ConfigTransformer.DEFAULT_PATTERN from class org.apache.kafka.connect.runtime.AbstractHerder
The error seems to happen at https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractHerder.java#L449
Do I need to set this DEFAULT.PATTERN manually? Is this not set by default.
I am using the docker image confluentinc/cp-kafka:5.0.1. The version of connect-api I am using in my connector app is org.apache.kafka:connect-api:2.0.0. I am running my set up inside Kubernetes.
The issue was resolved when I changed the image to confluentinc/cp-kafka:5.0.0-2.
I already tried this option before posting the question, but the pod was in a Pending state and was refusing to start. I thought that it could have been an issue with the image. Upon doing some more research later, I came to know that sometimes Kubernetes is unable to allot enough resources and hence pods can stay in Pending state.
I tried the image confluentinc/cp-kafka:5.0.0-2 and it works fine.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

how to monitor a job from another job in talend open studio 5.3.1 version

Hi i am beginer in Talend Open Studio 5.3.1 version.
currently i am facing issue in project i.e. schedule a job to run every 10 seconds and it monitor the other job and display output as status of another job which means the job is running or idle state.
Currently i am using Talend Open Studio 5.3.1 version by using this version it is possible or not .
explain me how to schelude a job for 10 seconds and display output as status of another job.
can anyone suggest and help me to solve my problem.
We should think a bit out of the box here. I'd solve this by using Project level logging: https://help.talend.com/display/TalendOpenStudioforBigDataUserGuide520EN/2.6+Customizing+project+settings
You'll have the jobs status stored in a database table, you just have to check whether the last execution of the job is still running or not. (Self join the stats table)
Monitoring jobs is not supported in Talend Open Studio, but there is some workaround:
Use a master job that launch the job to be monitored using tRunJob component, and your master job will have an idea whats going on.
Use empty files to synchronize your jobs, an empty file with a tricky name created by monitored jobs and the master job check them to get other jobs states.
Much easier is to use Quartz.