OpenSearch 1.3 > 2.3 upgrade, CloudFormation fails on domain update - aws-cloudformation

I recently updated our CDK code to move our OpenSearch cluster from version 1.3 to 2.3. The cluster itself seems to have upgraded to a healthy state and is still accessible / usable by our application, but CloudFormation failed when attempting to update our domain resource with:
Resource handler returned message: "Resource handler returned message: "Invalid request provided: DP Nodes are OOS, Tags operation is not allowed"
This kicked the stack into UPDATE_ROLLBACK_FAILED, which is not allowed. The cluster cannot be downgraded back to 1.3.
I'm struggling to find any information about this error it's kicking out and not quite sure how to resolve it to unblock the CloudFormation stack.
Things I have tried:
Digging through CloudWatch logs only revealed information pertaining to queries.
Forcing the rollback to occur without Domain resource. This got me back to an UPDATE_COMPLETE state, but each subsequent deploy of this stack will cause it to fail again since the core issue is not resolved.

This was an odd presentation of a permissions issue. As I was reading through some docs, I stumbled upon this section, which discusses changes to tag-based access control.
This lead me start looking into CloudTrail a bit and stumbled upon the exact error that was firing when this deploy happened. It was a little odd because the assumed role granted admin access to CloudFormation, but the last line of this event record caught my eye:
"sourceIPAddress": "cloudformation.amazonaws.com",
"userAgent": "cloudformation.amazonaws.com",
"errorCode": "ValidationException",
"errorMessage": "DP Nodes are OOS, Tags operation is not allowed",
"eventSource": "es.amazonaws.com",
Upon adding es.amazonaws.com to the trust relationship of that role, the deploy fully re-ran successfully.
Hopefully this helps someone else.

Related

Kubeflow fails to deploy using both CLI and Console

I deleted my KF cluster last night to create a new one (using kubectl cluster command not Kfctl delete), and then when I tied to create a new one, it fails, it does not work with CLI not Console. I found other people have run into this issue before, for example (here and here)
"However, as I said even with CLI my deployment fails, the error from console is:
ailed to apply: (kubeflow.error): Code 500 with message: coordinator Apply failed for gcp: (kubeflow.error): Code 500 with message: gcp apply could not update deployment manager Error could not update storage-kubeflow.yaml; Insert deployment error: googleapi: Error 403: Request had insufficient authentication scopes.
More details:
Reason: insufficientPermissions, Message: Insufficient Permission"
and the error I get from Console is:
"Please enable APIs for your project and try again
Please enable cloud resource manager API: https://console.developers.google.com/apis/api/cloudresourcemanager.googleapis.com/ and iam API: https://console.developers.google.com/apis/api/iam.googleapis.com/"
Note that this error is wrong, all the apis are active already. I'm quite sure this is a bug of KF but not sure how to find a workaround, any thoughts?
With CLI, I'm using my own account which has "owner" privileges.
Thanks
It seems you have an issue with IAM and the installation of Kubeflow, a 3rd party product that itself is not supported by us; nevertheless I went ahead and dig some information about this Machine Learning product.
The main issues (and although it seems you already cover permissions) are permissions, number of projects and some fine grained points.
I was checking and found out the following things that may help
a) Troubleshooting Kubeflow 1
b) Deploying Kubeflow in GKE[2]
c) Kubleflow auto deployer for GKE[3]
There are also some discussion about a mismatch permissions setting in Kubeflow that may be worth reading [4]
Finally there is a group that, also on a best-effort basis due the nature of Kubeflow:"google-kubeflow-support#google.com" that may come in handy.
I trust this information will be useful for you to solve your issue

WSO2 API Manager - APIs missing after recreating a pod

We have a setup of WSO2 API Management in a distributed pattern (pattern-3) in Kubernetes. We are using a PostgreSQL DB which is running outside the Kubernetes cluster for all the databases.
I have published some APIs in the publisher and am able to invoke them from the store.
I had to make a change in api-manager.xml for the API Publisher and API Store configmap files and recreated the pod. When the pods were available, I observed that the APIs that I had published and working earlier are not visible anymore.
I tried to add the same APIs again and it is complaining that the APIs by that name already exists.
Following is the log from the plubisher pod:
[2019-05-16 08:19:38,266] ERROR - APIProviderHostObject Error occurred while adding the document. PizzaShack API Documentation already exists for API PizzaShackAPI-1.0.0
[2019-05-16 08:19:38,273] ERROR - docs:jag org.wso2.carbon.apimgt.api.APIManagementException: Error occurred while adding the document. PizzaShack API Documentation already exists for API PizzaShackAPI-1.0.0
While creating the API again on the Publisher, following error is displayed: "Duplicate API Name"
It clearly seems to be some synchronization issue. How can this issue be fixed?
I had shared the instance of Carbon DB across the components. This was causingthe issue. Using separate instance for each component in disbuted mode solved it

IBM Cloud Private CE - Unauthorized Access to Catalog

I have installed ICP CE 2.1.0 on a google cloud VM and the installation has gone well-no errors in installation process. When accessing the GUI I am able to see deployments and services but as soon as I access any part of the Catalog I get a blank white page with the text:
{"statusCode":401,"details":"Unexpected response code 401 from request:\nGET https://xx.xxx.xxx.xx:8443/console/api/v1/header?serviceId=catalog-ui&dev=false&accessUrl=https%3A%2F%2Fxx.xxx.xxx.xx%3A8443* ...... }
I have tried killing the individual pods but I get same error. When looking a the pod logs for the catalog-ui I have error 500 messages.
Has anyone experienced this or can tell my why this is the case? Understand that a cloud VM is not the best use case maybe but it should work?
Can you confirm the version level of ICP? Your post mentioned "ICP CE 2.1.0" but if you can check the user icon (top right corner) and click About, we should be able to see the full version details.
Reason for asking is that, at the 2100 level there was an intermittent catalog issue just like you describe. Generally it was caused by resource constraints on the k8s.
Details for ICP 2103, which is the latest available release:
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0.3/getting_started/whats_new.html

Can I force delete an AWS CloudFormation stack that is In Progress of Rollback

An AWS CloudFormation rollback (e.g., UPDATE_ROLLBACK_IN_PROGRESS) has been in progress forever, like over an hour and a half. I want to delete the stack altogether or force stop any activity. Is this possible?
Thanks!
Another common cause of blocked stack updates/rollbacks is errors in ECS::Service resource updates: it doesn't look like that is currently detected (in some cases?). Cloudformation is waiting for the service event for the service reaching a steady state, so simply updating the service to something that works (e.g. desired tasks to 0) will unblock it. Try to get the state back to what Cloudformation expects before sending more updates to avoid problems, though.
I guess your stack resources are changed or deleted by outside.
You can find official guide as below.
Manually sync resources so that they match the original stack's template, and then continue rolling back the update. For example, if you manually deleted a resource that AWS CloudFormation is attempting to roll back to, you must manually create that resource with the same name and properties it had in the original stack.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#troubleshooting-errors-update-rollback-failed
or (as #talentedmrjones said)
To fix the stack, contact AWS customer support.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html#troubleshooting-errors-nested-stacks-are-stuck
In my case, I can stop same situation via re-creating deleted resource.
In my case it is an EC2 security group that cannot be deleted because it is referenced from another EC2 security group.
When dealing with a custom resource it is possible to construct a mocked up version of the return url.
The easiest way to do this is to grab the url which was used during the create. If you can get your hands on it, replace the section after the last %2F with the "Client Request Token" which you can get from your event log for the cloudformation.
If not, then here's the format of the url you'll have to construct.
https://{region}.console.aws.amazon.com/cloudformation/home?region={region}#/stacks?filter=active&tab=events&stackId={stack arn}%2F{stack name}%2F{client request token}
Run that url as a get and it will cause the resource to fail rollback or delete.
You can try to delete the resources and then the update rollback will complete successfully.
Sometimes this will occur if your user role is missing permissions to delete roles. This can be tested by trying to manually delete roles or users that have been created by the CloudFormation stack.
I had something like this happen once, and the stack seemed stuck forever in UPDATE_ROLLBACK_IN_PROGRESS status. I'd recommend submitting a ticket to AWS support. That was the only way I was able to resolve it.
Was able to delete mine by manually deleting everything via AWS dashboard. I ended up having a couple dangling roles that just needed deletion.
I meet the same problem.
The console told me some resource depends on another, so can't be deleted. Under that state, rollback in unavailable.
I just delete the whole VPC and the resources in that VPC.
Because cloudformation will retry to delete resource every 10-20 min. So when it retry, it will find the resource have already been deleted, and it just skip the deletion and everything is smooth after that.
Yes, use this command to delete stacks stuck in 'DELETE_IN_PROGRESS' state.
You can easily run this in AWS CloudShell also.
Go to Lambda Function->Monitor->CloudWatch Logs. Look for Log where "RequestType" is "Delete" and Copy the necessary fields to below command
curl -H 'Content-Type: ''' -X PUT -d '{"Status": "SUCCESS","PhysicalResourceId": "Add your physical resource ID", "StackId": "Add your StackId","RequestId": "Add your RequestID","LogicalResourceId": "LambdaFunction"}' 'Add your ResponseURL Here'
Example:
curl -H 'Content-Type: ''' -X PUT -d '{"Status": "SUCCESS","PhysicalResourceId": "cutomRes-LambdaFunction-1NC1ORF", "StackId": "arn:aws:cloudformation:us-east-1:3343:stack/cutomRes/f52a-11eb-b5df-0a5c2cc1","RequestId": "d70931a2-364b-413e-a2","LogicalResourceId": "LambdaFunction"}' 'https://cloudformation-custom-resource-response-useast1.s3.amazonaws.com/arn%3Aaws%/cutomRes/f5466f6Expires=7200&X-Amz-Credential=AKIA6L7Q4OWT3GW5BT7K%2F20210330%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=1db1f83f'
Do Note that, example contains URL that mmaybe modified to not work for security purposes. It is for demonstration purposes only.
You will need to investigate why exactly the rollback is taking so long (e.g., if it's due to a missing resource modified outside of the CloudFormation stack, or a Custom Resource that failed to return the expected signals).
I went to the stack resources tab and checked why some of them couldn't be deleted then I deleted them manually first.
Usually, it works with just a quick refresh.
We need to go to the resources section and check which resource it is trying to delete. Go to that resource and check why Cloud formation is not able to delete that resource.
What we can do is try deleting the resource manually and check the error or dependency. Fix that and the stack will again then continue. Depending on you stack design and dependencies you might have to delete manually or fix manually multiple resources.
Check this for more details :
https://aws.amazon.com/premiumsupport/knowledge-center/cloudformation-stack-stuck-progress/

Google cloud datalab deployment unsuccessful - sort of

This is a different scenario from other question on this topic. My deployment almost succeeded and I can see the following lines at the end of my log
[datalab].../#015Updating module [datalab]...done.
Jul 25 16:22:36 datalab-deploy-main-20160725-16-19-55 startupscript: Deployed module [datalab] to [https://main-dot-datalab-dot-.appspot.com]
Jul 25 16:22:36 datalab-deploy-main-20160725-16-19-55 startupscript: Step deploy datalab module succeeded.
Jul 25 16:22:36 datalab-deploy-main-20160725-16-19-55 startupscript: Deleting VM instance...
The landing page keeps showing a wait bar indicating the deployment is still in progress. I have tried deploying several times in last couple of days.
About additions described on the landing page -
An App Engine "datalab" module is added. - when I click on the pop-out url "https://datalab-dot-.appspot.com/" it throws an error page with "404 page not found"
A "datalab" Compute Engine network is added. - Under "Compute Engine > Operations" I can see a create instance for datalab deployment with my id and a delete instance operation with *******-ompute#developer.gserviceaccount.com id. not sure what it means.
Datalab branch is added to the git repo- Yes and with all the components.
I think the deployment is partially successful. When I visit the landing page again, the only option I see is to deploy the datalab again and not to start it. Can someone spot the problem ? Appreciate the help.
I read the other posts on this topic and tried to verify my deployment using - "https://console.developers.google.com/apis/api/source/overview?project=" I get the following message-
The API doesn't exist or you don't have permission to access it
You can try looking at the App Engine dashboard here, to verify that there is a "datalab" service deployed.
If that is missing, then you need to redeploy again (or switch to the new locally-run version).
If that is present, then you should also be able to see a "datalab" network here, and a VM instance named something like "gae-datalab-main-..." here. If either of those are missing, then try going back to the App Engine console, deleting the "datalab" service, and redeploying.