gcloud run deploy stuck at Routing traffic - gcloud

So today I started deploying some stuff to gclod and i got the usually process:
⠶ Building and deploying... Recreating retired Revision.
✓ Uploading sources...
✓ Building Container... Logs are available at [https://console.cloud.google.com/cloud-build/builds/4ee7f621-c548-4630-acbc-6bcd7daa3275?project=638685146920].
✓ Creating Revision...
⠶ Routing traffic...
But the unusual thing is that the Routing traffic took forever and after some time i got:
Deployment failed
ERROR: gcloud crashed (WaitException): last_result=<googlecloudsdk.api_lib.run.condition.Conditions object at 0x7f179f99c2b0>, last_retrial=1326, time_passed_ms=1799923,time_to_wait=1000

SOLUTION: Just wait a few hours, and it will be back up

delete all your code and git reset --hard origin/master worked for me.
originally i npm i and still same problem. maybe theres a hidden gcloud data that needed to be deleted.

Related

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

Azure app service deployment fails at core-js postinstall

I am deploying a teams app using custom deployment template and a git repo url. The deployment was successfull previously and from last week deployment is failing at core-js postinstall. Below is the log for the same.
log file
Please let me know what I am missing and why only at core-js it fails?
log: > core-js#3.22.7 postinstall C:\home\site\repository\Source\Microsoft.Teams.Apps.SubmitIdea\ClientApp\node_modules\core-js
node -e "try{require('./postinstall')}catch(e){}"
Command 'starter.cmd "C:\home\site\d ...' was aborted due to no output nor CPU activity for 60 seconds. You can increase the SCM_COMMAND_IDLE_TIMEOUT app setting (or WEBJOBS_IDLE_TIMEOUT if this is a WebJob) if needed.\r\nstarter.cmd "C:\home\site\deployments\tools\deploy.cmd"
node -e "try{require('./postinstall')}catch(e){}"
Input string was not in a correct format.
My deployment issue got fixed when I changed Default node version to ~16 instead of 16.15.0 in deployment template. Azure supports 16.13.0 as of now.

How to sync user directory on bitbucket server to jira with both running on aks?

When trying to sync the user directories of Jira to other atlassian products (confluence and bitbucket server running on aks) a 403 error is returned.
Upon looking into this error the following steps have been attempted:
https://confluence.atlassian.com/stashkb/unable-to-connect-to-jira-for-authentication-forbidden-403-323391874.html
The IP adresses have been added to the whitelist of Jira. The next step in solutions online is to restart the Jira
service.
This however causes issues as upon running the stop/start-jira.sh files inside the pod the service returns
with none of the previous settings and all configurations including backups are gone. Taking us back to square one.
cluster size:
current set-up
3 x Standard D8 v3 (8 vcpus, 32 GiB memory) cluster on aks
Used the following images installed through UI:
atlassian/jira-software
cptactionhank/docker-atlassian-jira
Exec into pod and go to /opt/atlassian/jira/bin
run ./(start/stop)-jira.sh
What should happen is that when going back to the url the Jira instance is reset and all configuration files in the pod for the service are lost.
The logs of the pod give error no 137 as a common error when restarting.
update:
https://github.com/int128/devops-kompose/tree/master/atlassian-jira-software
The following helm chart has also been used and achieved the same result.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

Service Fabric "Waiting for upgrade..." using VSTS

I've configured upgrades on my VSTS release of a Service Fabric app containing 5 services to a single node test environment on Azure. Unfortunately when it gets to the release part it just hangs saying "Waiting for upgrade..." over and over again. I left it for 15 hours and it still says the same thing. The initial deployment went ahead without issue.
I've looked at various posts about turning off health check times, but this has not been successful. I've also tried setting the mode to UnmonitoredAuto, but no success.
I've RDPd onto the environment and checked the processor/memory usage in task manager, and everything is pretty much 0%, and very low memory usage.
Is there anything else I can do to stop the upgrade hanging?
OK, I've managed to fix this. This was happening because there is a PreUpgradeSafetyCheck that happens before rolling out an upgrade. This is not relevant for a single node cluster as downtime is inevitable for single node clusters.
The status of an upgrade can be found using: Get-ServiceFabricApplicationUpgrade. Which shows the status above.
To fix this there is a flag: UpgradeReplicaSetCheckTimeoutSec in the release task. Setting the value to 0 sorts things out.