Active Deploy `begin` step fails after upgrade to devops toolchain - ibm-cloud

We recently upgrade our IBM Bluemix devops project to a toolchain as recommended by IBM and it doesn't deploy anymore. The pipeline configuration seems to have migrated over correctly, and the first step of the process deploy process even works, creating a new instance of the app. However when it gets to the active-deploy-begin step it fails with the error:
--- ERROR: Unknown status:
--- ERROR: label: my-app_220-to-my-app_2 space: my-space routes: my-app.mybluemix.net
phase: rampup start group: my-app_220 app (1) successor group: my-app_2 app (1) algorithm: rb
deployment id: 84630da7-8663-466a-bb99-e02d2eb17a90 transition type: manual
rampup duration: 4% of 2m test duration: 1s
rampdown duration: 2m status: in_progress status messages: <none>
It appears to have started the build number from 1 instead of continuing from the previous number of 220. I've tried deleting the service at the app level from the Bluemix web interface to no avail. Any help or pointers will be much appreciated.
UPDATE:
Things I've tried:
Deleting the app and running the build process to create a new
instance. This worked the first time as it detected it was just the
initial build. But then the second time it ran it failed with the
same Unknown Status error.
Deleting all the previous deployment records in the to eliminate the possibility that it was caused due to a deployment label name
conflict. i.e. my-app_1-to-my-app_2
Also interestingly the active deploy command works from the cf command line using the active-deploy-create my-app_1 my-app_2 command. So it seems that the issue might be with the script that runs the active deploy commands for the pipeline.

This issue was reported also at https://github.com/Osthanes/update_service/issues/54. There you will find instructions how to get the issue fixed.

Related

Azure Devops deployment failure: The 'Performing deployment' operation conflicts with the pending 'Performing deployment' operation

I am trying to create a devops pipeline to deploy an azure function.
Each time I try I get the error:
BadRequest: The 'Performing deployment' operation conflicts with the
pending 'Performing deployment' operation started at 2022-08-16T13:01:47.6881726Z.
Please retry operation later.
I have waited 2 hours and still get this error.
In the resource group, i cannot see any pending deployments, only failed deployments.
Also, get-AzDeployment cmdlet returns no data so i cant find any deployments that may be blocking.
Any ideas how to resolve this?
The message usually indicates that there is another pending deployment operation that is ongoing and it would prevent the new deployment. However, you don't see any results from running the get-AzDeployment cmdlet. Additionally, you can run the Get-AzDeploymentOperation cmdlet as it lists all the operations that were part of a deployment to help you identify and give more information about the exact operations that failed for a particular deployment.
https://learn.microsoft.com/en-us/powershell/module/az.resources/get-azdeploymentoperation?view=azps-8.3.0
I found the issue was due to exporting the arm template from the portal.
For some reason the portal included a bunch of "deployments" in the template. These deployments which had already run were clashing when I ran the template.
I've no idea why it includes old deployments in the template but deleting them resolved that particular issue.

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

Azure app service deployment fails at core-js postinstall

I am deploying a teams app using custom deployment template and a git repo url. The deployment was successfull previously and from last week deployment is failing at core-js postinstall. Below is the log for the same.
log file
Please let me know what I am missing and why only at core-js it fails?
log: > core-js#3.22.7 postinstall C:\home\site\repository\Source\Microsoft.Teams.Apps.SubmitIdea\ClientApp\node_modules\core-js
node -e "try{require('./postinstall')}catch(e){}"
Command 'starter.cmd "C:\home\site\d ...' was aborted due to no output nor CPU activity for 60 seconds. You can increase the SCM_COMMAND_IDLE_TIMEOUT app setting (or WEBJOBS_IDLE_TIMEOUT if this is a WebJob) if needed.\r\nstarter.cmd "C:\home\site\deployments\tools\deploy.cmd"
node -e "try{require('./postinstall')}catch(e){}"
Input string was not in a correct format.
My deployment issue got fixed when I changed Default node version to ~16 instead of 16.15.0 in deployment template. Azure supports 16.13.0 as of now.

I'm having issues with DevOps production deployment - Unable to edit or replace deployment

Up until yesterday morning I was able to deploy data factory v2 changes in my release pipeline. Then last night during deployment I received an error that the connection was forced closed. Now when I try to deploy to the production environment, I get this error: "Unable to edit or replace deployment 'ArmTemplate_18': previous deployment from '12/10/2019 10:19:27 PM' is still active (expiration time is '12/17/2019 10:19:23 PM')". Am I supposed to wait a week for this error to clear itself?
This message indicates that there’s another deployment going on, with the same name, in the same ARM Resource Group. In order to perform your new deployment, you’ll need to either:
Wait for the existing deployment to complete
Stop the in-progress / active deployment
You can stop an active deployment by using the Stop-AzureRmResourceGroupDeployment PowerShell command or the azure group deployment stop command in the xPlat CLI tool. Please refer to this case.
Or you can open target Resource Group on the azure portal, go to Deployment tab, find not completed deployments, cancel it, start new deploy. You can refer to this issue for details.
In addition, there is a recently event of availability degradation of Azure DevOps .This could also have an impact. Now the engineers have mitigated this event.

Azure Service Fabric Cluster Update

I have a cluster in Azure and it failed to update automatically so I'm trying a manual update. I tried via the portal, it failed so I kicked off an update using PS, it failed also. The update starts then just sits at "UpdatingUserConfiguration" then after an hour or so fails with a time out. I have removed all application types and check my certs for "NETWORK SERVCIE". The cluster is 5 VM single node type, Windows.
Error
Set-AzureRmServiceFabricUpgradeType : Code: ClusterUpgradeFailed,
Message: Cluster upgrade failed. Reason Code: 'UpgradeDomainTimeout',
Upgrade Progress:
'{"upgradeDescription":{"targetCodeVersion":"6.0.219.9494","
targetConfigVersion":"1","upgradePolicyDescription":{"upgradeMode":"UnmonitoredAuto","forceRestart":false,"u
pgradeReplicaSetCheckTimeout":"37201.09:59:01","kind":"Rolling"}},"targetCodeVersion":"6.0.219.9494","target
ConfigVersion":"1","upgradeState":"RollingBackCompleted","upgradeDomains":[{"name":"1","state":"Completed"},
{"name":"2","state":"Completed"},{"name":"3","state":"Completed"},{"name":"4","state":"Completed"}],"rolling
UpgradeMode":"UnmonitoredAuto","upgradeDuration":"02:02:07","currentUpgradeDomainDuration":"00:00:00","unhea
lthyEvaluations":[],"currentUpgradeDomainProgress":{"upgradeDomainName":"","nodeProgressList":[]},"startTime
stampUtc":"2018-05-17T03:13:16.4152077Z","failureTimestampUtc":"2018-05-17T05:13:23.574452Z","failureReason"
:"UpgradeDomainTimeout","upgradeDomainProgressAtFailure":{"upgradeDomainName":"1","nodeProgressList":[{"node
Name":"_mstarsf10_1","upgradePhase":"PreUpgradeSafetyCheck","pendingSafetyChecks":[{"kind":"EnsureSeedNodeQu
orum"}]}]}}'.
Any ideas on what I can do about a "EnsureSeedNodeQuorum" error ?
The root cause was only 3 seed nodes in the cluster as a result of the cluster being build with a VM scale set that had "overprovision" set to true. Lesson learned, remember to set "overprovision" to false.
I ended up deleting the cluster and scale set and recreated using my stored ARM template.