How to find the auto-created service connection when deploying to AKS - azure-devops

During a pipeline run, under deployment job, providing a deployment environment eliminates the need of providing service connection manually. I'd guess, it's either creating a new SC at this time or it would have created SC at the time of environment creation and using the same.
Either ways, is there a way to find out which Service connection is being used from the logs of pipeline run or from anywhere else?
In our environment, I see a lot of service connection for one environment and a cleanup is necessary to get things in place.
I tried giving SC manually along with environment and it works as expected. So, going forward, I can use this method. But for cleanup, I'd still like to know which one gets used when not specified! (none of the auto-created SCs show any execution history, but I know the deployment has happened multiple times)

As a Kubernetes resource in an environment is referencing Kubernetes service connection, you can use this API to list the serviceEndpointId of a Kubernetes resource, which is also the resourceId of the referenced service connection.
GET https://dev.azure.com/{organization}/{project}/_apis/distributedtask/environments/{environmentId}/providers/kubernetes/{resourceId}?api-version=7.0
Applied with the value of the serviceEndpointId from the response of the above API, we can proceed to use this API to get the referenced service connection details.
GET https://dev.azure.com/{organization}/{project}/_apis/serviceendpoint/endpoints/{endpointId}?api-version=7.0

Related

Container App Environment creation timing out

Where I work has just started migrating to the cloud. We've successfully deployed a number of resources using Terraform and Pipelines into Azure.
Where we are running into issues is deploying a Container App Environment, we have code that was working in a less locked down environment (setup for Proof of Concept), but are now having issues using that code in our go-forward.
When deploying, the Container App Environment spends 30mins attempting to create before it returns a context deadline exceeded error. Looking in Azure Portal, I can see the resource in "Waiting" provisioning state and I can also see the MC_ and AKS resources that get generated. It then fails around 4hrs later.
Any advice?
I am suspecting it's related to security on the Virtual Network that the subnets are sitting on, but I'm not seeing any logs on the deployment to confirm. The original subnets had a Network Security Group (NSG) assigned and I configured the rules that Microsoft provide before I added a couple of subnets without an NSG assigned and no luck.
My next step is to try provisioning it via the GUI and see if that works.
I managed to break our build in the "anything goes" environment.
The root cause is an incomplete configuration of the Virtual Network which has custom DNS entries. This has now been passed to our network architects to resolve. If I can get more details on the fix they apply I'll include that here for anyone else that runs into the issue.

How to create a cron job in a Kubernetes deployed app without duplicates?

I am trying to find a solution to run a cron job in a Kubernetes-deployed app without unwanted duplicates. Let me describe my scenario, to give you a little bit of context.
I want to schedule jobs that execute once at a specified date. More precisely: creating such a job can happen anytime and its execution date will be known only at that time. The job that needs to be done is always the same, but it needs parametrization.
My application is running inside a Kubernetes cluster, and I cannot assume that there always will be only one instance of it running at the any moment in time. Therefore, creating the said job will lead to multiple executions of it due to the fact that all of my application instances will spawn it. However, I want to guarantee that a job runs exactly once in the whole cluster.
I tried to find solutions for this problem and came up with the following ideas.
Create a local file and check if it is already there when starting a new job. If it is there, cancel the job.
Not possible in my case, since the duplicate jobs might run on other machines!
Utilize the Kubernetes CronJob API.
I cannot use this feature because I have to create cron jobs dynamically from inside my application. I cannot change the cluster configuration from a pod running inside that cluster. Maybe there is a way, but it seems to me there have to be a better solution than giving the application access to the cluster it is running in.
Would you please be as kind as to give me any directions at which I might find a solution?
I am using a managed Kubernetes Cluster on Digital Ocean (Client Version: v1.22.4, Server Version: v1.21.5).
After thinking about a solution for a rather long time I found it.
The solution is to take the scheduling of the jobs to a central place. It is as easy as building a job web service that exposes endpoints to create jobs. An instance of a backend creating a job at this service will also provide a callback endpoint in the request which the job web service will call at the execution date and time.
The endpoint in my case links back to the calling backend server which carries the logic to be executed. It would be rather tedious to make the job service execute the logic directly since there are a lot of dependencies involved in the job. I keep a separate database in my job service just to store information about whom to call and how. Addressing the startup after crash problem becomes trivial since there is only one instance of the job web service and it can just re-create the jobs normally after retrieving them from the database in case the service crashed.
Do not forget to take care of failing jobs. If your backends are not reachable for some reason to take the callback, there must be some reconciliation mechanism in place that will prevent this failure from staying unnoticed.
A little note I want to add: In case you also want to scale the job service horizontally you run into very similar problems again. However, if you think about what is the actual work to be done in that service, you realize that it is very lightweight. I am not sure if horizontal scaling is ever a requirement, since it is only doing requests at specified times and is not executing heavy work.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

How to set automatic rollbacks in CodeDeploy with CloudFormation?

I'm creating a Deployment Group in CodeDeploy with a CloudFormation template.
The Deployment Group is successfully created and the application is deployed perfectly fine.
The CF resource that I defined (Type: AWS::CodeDeploy::DeploymentGroup) has the "Deployment" property set. The thing is that I would like to configure automatic rollbacks for this deployment, but as per CF documentation for "AutoRollbackConfiguration" property: "Information about the automatic rollback configuration that is associated with the deployment group. If you specify this property, don't specify the Deployment property."
So my understanding is that if I specify "Deployment", I cannot set "AutoRollbackConfiguration"... Then how are you supposed to configure any rollback for the deployment? I don't see any other resource property that relates to rollbacks.
Should I create a second DeploymentGroup resource and bind it to the same instances that the original Deployment Group has? I'm not sure this is possible or makes sense but I ran out of options.
Thanks,
Nicolas
First i like to describe why you cannot specify both, deployment and rollback configuration:
Whenever you specify a deployment directly for the group, you already state which revision you like to deploy. This conflicts with the idea of CloudFormation of having resources managed by it without having a drift in the actual configuration of those resources.
I would recommend the following:
Use CloudFormation to deploy the 'underlying' infrastructure (the deployment group, application, roles, instances, etc.)
Create a CodePipline within this infrastructure template, which then includes a CodeDeploy deployment action (https://docs.aws.amazon.com/codepipeline/latest/userguide/action-reference-CodeDeploy.html, https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-codepipeline-pipeline-stages-actions-actiontypeid.html)
The pipeline can triggered whenever you have a new version inside you revision location
This approach clearly separates the underlying stuff, which is not changing dynamically and the actual application deployment, done using a proper pipeline.
Additionally in this way you can specify how you like to deploy (green/blue, canary) and how/when rollbacks should be handled. The status of your deployment also to be seen inside CodePipeline.
I didn't mention it but what you are suggesting about CodePipeline is exactly what I did.
In fact, I have one CloudFormation template that creates all the infrastructure and includes the DeploymentGroup. With this, the application is deployed for the first time to my EC2 instances.
Then I have another CF template for CI/CD purposes with a CodeDeploy stage/action that references the previous DeploymentGroup. Whenever I push some code to my repository, the Pipeline is triggered, code is built and new version successfully deployed to the instances.
However, I don't see how/where in any of the CF templates to handle/configure the rollback for the DeploymentGroup as you were saying. I think I get the idea of your explanation about the conflict CF might have in case of having a drift, but my impression is that in case of errors during the CF stack creation, CF rollback should just remove the DeploymentGroup you're trying to create. In other words, for me there's no CodeDeploy deployment rollback involved in that scenario, just removing the resource (DeploymentGroup) CF was trying to create.
One thing that really impresses me is that you can enable/disable automatic rollbacks for the DeploymentGroup through the AWS Console. Just edit and go to Advanced Configuration for the DeploymentGroup and you have a checkbox. I tried it and triggered the Pipeline again and worked perfectly. I made a faulty change to make the deployment fail in purpose, and then CodeDeploy automatically reverted back to the previous version of my application... completely expected behavior. Doesn't make much sense that this simple boolean/flag option is not available through CF.
Hope this makes sense and helps clarifying my current situation. Any extra help would be highly appreciated.
Thanks again

Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.