Is there a way to run AWS cloudformation update in dry run mode? Cos I realised that the aws validate does not pick up all the errors and when you run the update new errors are being thrown.
The closest you could get to this is running mock testing which will require a fair amount of configuration time; it also cannot 100% guarantee you won't encounter any errors.
Moto has an extensive library dedicated to mock running AWS infrastructure and would be worth checking out, with the core endpoints of cloudformation included.
Related
I am developing a moderately complex (for me) serverless infrastructure on AWS that consists of about 50 lambdas and would like to start fleshing out the API documentation but am finding it very tedious. Right now any time I want to make a minor schema change, documentation change, etc. in my YAML, I am re-deploying the whole Cloud Formation and then regenerating the OAS3 API with
sls deploy
aws apigateway get-export --parameters extensions='apigateway' \
--rest-api-id $API_ID --stage-name dev --export-type oas30 \
latest_changes.json
This is obviously pretty time consuming and I feel like there must be a better way. I poked around with with serverless-documentation plugin, but that still seems to require a redeploy (and only works with OAS2), and I've now started investigating serverless-offline (which I wish I knew about in the past), but before I go down that rabbit hole I wanted to see if there's a better way to do this.
I was ultimately able to do what I want using the serverless-openapi plugin (I'm actually using the pvdlg fork), which uses a custom:documentation style very similar to the OAS2 serverless-aws-documentation plugin from serverless.com. With that plugin installed I can then do sls openapi generate -f json -o oas3api.json to output swagger that's pretty close to fully documented, though I did have to write a small script to ingest the swagger, add in per-path security sections (which the plugin doesn't seem to support) and an info block for things like a contact name and TOS, and spit it back out again.
But at least it means no more waiting for AWS to update the CloudFormation just so I can update my docs!
Build servers are generally detached from the VPC running the instance. Be it Cloud Build on GCP, or utilising one of the many CI tools out there (CircleCI, Codeship etc), thus running DB schema updates is particularly challenging.
So, it makes me wonder.... When's the best place to run database schema migrations?
From my perspective, there are four opportunities to automatically run schema migrations or seeds within a CD pipeline:
Within the build phase
On instance startup
Via a warm-up script (synchronously or asynchronously)
Via an endpoint, either automatically or manually called post deployment
The primary issue with option 1 is security. With Google Cloud Sql/Google Cloud Build, it's been possible for me to run (with much struggle), schema migrations/seeds via a build step and a SQL proxy. To be honest, it was a total ball-ache to set up...but it works.
My latest project is utilising MongoDb, for which I've connected in migrate-mongo if I ever need to move some data around/seed some data. Unfortunately there is no such SQL proxy to securely connect MongoDb (atlas) to Cloud Build (or any other CI tools) as it doesn't run in the instance's VPC. Thus, it's a dead-end in my eyes.
I'm therefore warming (no pun intended) to the warm-up script concept.
With App Engine, the warm-up script is called prior to traffic being served, and on the host which would already have access via the VPC. The warmup script is meant to be used for opening up database connections to speed up connectivity, but assuming there are no outstanding migrations, it'd be doing exactly that - a very light-weight select statement.
Can anyone think of any issues with this approach?
Option 4 is also suitable (it's essentially the same thing). There may be a bit more protection required on these endpoints though - especially if a "down" migration script exists(!)
It's hard to answer you because it's an opinion based question!
Here my thoughts about your propositions
It's the best solution for me. Of course you have to take care to only add field and not to delete or remove existing schema field. Like this, you can update your schema during the Build phase, then deploy. The new deployment will take the new schema and the obsolete field will no longer be used. On the next schema update, you will be able to delete these obsolete field and clean your schema.
This solution will decrease your cold start performance. It's not a suitable solution
Same remark as before, in addition to be sticky to App Engine infrastructure and way of working.
No real advantage compare to the solution 1.
About security, Cloud Build will be able to work with worker pool soon. Still in alpha but I expect in the next month an alpha release of it.
Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.
The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.
I'm comparing salt-cloud and terraform as tools to manage our infrastructure at GCE. We use salt stack to manage VM configurations, so I would naturally prefer to use salt-cloud as an integral part of the stack and phase out terraform as a legacy thing.
However my use case is critical on VM deployment time because we offer PaaS solution with VMs deployed on customer request, so need to deliver ready VMs on a click of a button within seconds.
And what puzzles me is why salt-cloud takes so long to deploy basic machines.
I have created neck-to-neck simple test with deploying three VMs based on default CentOS7 image using both terraform and salt-cloud (both in parallel mode). And the time difference is stunning - where terraform needs around 30 seconds to deploy requested machines (which is similar to time needed to deploy through GCE GUI), salt-cloud takes around 220 seconds to deploy exactly same machines under same account in the same zone. Especially strange is that first 130 seconds salt-cloud does not start deploying and does seemingly nothing at all, and only after around 130 seconds pass it shows message deploying VMs and those VMs appear in GUI as in deployment.
Is there something obvious that I'm missing about salt-cloud that makes it so slow? Can it be sped up somehow?
I would prefer to user full salt stack, but with current speed issues it has I cannot really afford that.
Note that this answer is a speculation based on what I understood about terraform and salt-cloud, I haven't verified with an experiment!
I think the reason is that Terraform keeps state of the previous run (either locally or remotely), while salt-cloud doesn't keep state and so queries the cloud before actually provisioning anything.
These two approaches (keeping state or querying before doing something) are needed, since both tools are idempotent (you can run them multiple times safely).
For example, I think that if you remove the state file of Terraform and re-run it, it will assume there is nothing in the cloud and will actually instantiate a duplicate. This is not to imply that terraform does it wrong, it is to show that state is important and Terraform docs say clearly that when operating in a team the state should be saved remotely, exactly to avoid this kind of problem.
Following my line of though, this should also mean that if you either run salt-cloud in verbose debug mode or look at the network traffic generated by it, in the first 130 secs you mention (before it says "deploying VMs"), you should see queries from salt-cloud to the cloud provider to dynamically construct the state.
Last point, the fact that salt-cloud doesn't store the state of a previous run doesn't mean that it is automatically safe to use in a team environment. It is safe to use as long as no two team members run it at the same time. On the other hand, terraform with remote state on Consul allows for example to lock, so that team concurrent usage will always be safe.