I can't start creating service fabric cluster.
When starting creation portal always shows "Rainy Cloud" and nothing can be inserted?
Thanks for reporting this, we found a problem in the portal that may be causing this. We'll be rolling out a fix in the next few days.
BTW, we have a repo on GitHub that you can use to report issues like this for a faster response: https://github.com/Azure/service-fabric-issues
Related
Where I work has just started migrating to the cloud. We've successfully deployed a number of resources using Terraform and Pipelines into Azure.
Where we are running into issues is deploying a Container App Environment, we have code that was working in a less locked down environment (setup for Proof of Concept), but are now having issues using that code in our go-forward.
When deploying, the Container App Environment spends 30mins attempting to create before it returns a context deadline exceeded error. Looking in Azure Portal, I can see the resource in "Waiting" provisioning state and I can also see the MC_ and AKS resources that get generated. It then fails around 4hrs later.
Any advice?
I am suspecting it's related to security on the Virtual Network that the subnets are sitting on, but I'm not seeing any logs on the deployment to confirm. The original subnets had a Network Security Group (NSG) assigned and I configured the rules that Microsoft provide before I added a couple of subnets without an NSG assigned and no luck.
My next step is to try provisioning it via the GUI and see if that works.
I managed to break our build in the "anything goes" environment.
The root cause is an incomplete configuration of the Virtual Network which has custom DNS entries. This has now been passed to our network architects to resolve. If I can get more details on the fix they apply I'll include that here for anyone else that runs into the issue.
I am deploying a stack with CDK. It gets stuck in CREATE_IN_PROGRESS. CloudTrail logs show repeating events in logs:
DeleteNetworkInterface
CreateLogStream
What should I look at next to continue debugging? Is there a known reason for this to happen?
I also saw the exact same issue with the deployment of a CDK-based ECS/Fargate Deployment
In my instance, I was able to diagnose the issue by following the content from the AWS support article https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/
What specifically diagnosed and then resolved it for me:-
I updated my ECS service to set the desired task count of the ECS Service to 0. At that point the Cloud Formation stack did complete successfully.
From that, it became obvious that the actual issue was related to the creation of the initial task for my ECS Service. I was able to diagnose that by reviewing the output in Deployment and Events Tab of the ECS Service in the AWS Management Console. In my case, the task creation was failing because of an issue with accessing the associated ECR repository. Obviously there could be other reasons but they should show-up there.
This morning we noticed that all Kubernetes clusters in all projects ( 2 projects, 2 clusters per project ) showed unavailable / ERROR in the Google Cloud Console.
The dashboard shows no current issues: https://status.cloud.google.com/
It basically looks like the master nodes are down, the API does not respond and the clusters cannot be edited in the UI. Before the weekend everything was up and since at least yesterday evening they all show in red.
The deployed services fortunately respond, but we cannot manage the cluster in any way.
I reported it here too:
https://issuetracker.google.com/issues/172841082
Did anyone else encounter this and is there any way to restart or trigger the master node to restart? I cannot edit the cluster so an upgrade is not possible either.
I read elsewhere that only SRE folks from Google can (re)start them.
It's beyond me how this can happen.
By the way, auto-repair is set to on and I followed the troubleshooting page, basically with all paths leading to : master node down, nothing to be done.
Any help would be greatly appreciated, or simply a SRE doing a start node action ;).
Thank you #dany L, it was indeed a billing issue.
I'm surprised there is nothing like a message in the Cloud Console and one has to go to billing specifically to find out about this.
After billing was fixed, it took a few minutes while before the clusters were available, then everything looked back to normal.
The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.
I am a having a problem for the last couple of hours, upon deployment of a pipeline, it goes into the provisioning state and then gets stuck there.
It then eventually fails with one of these two errors:"Failed to reach service" or "Internal Server Error."
The state of the pipeline is stuck at "Pending Update"
I have even created a Data Factory from scratch, created new linked services and data sets, and created the pipeline within that data factory, and the same thing is happening.
What can be causing this issue?
I have to add that I was using DF and deploying pipelines and everything was going well, this was a sudden issue.
We have had occasional service incidents in the past which have transiently impacted provisioning of ADF entities. If you are still experiencing this issue please open a support ticket with the specific info for your data factory. We appreciate your feedback!