How to wait for full cloud-initialization before VM is marked as running - azure-devops

I am currently configuring a virtual machine to work as an agent within Azure (with Ubuntu as image). In which the additional configuration is running through a cloud init file.
In which, among others, I have the below 'fix' within bootcmd and multiple steps within runcmd.
However the machine already gives the state running within the azure portal, while still running the cloud configuration phase (cloud_config_modules). This has as a result pipelines see the machine as ready for usage while not everything is installed/configured yet and breaks.
I tried a couple of things which did not result in the desired effect. After which I stumbled on the following article/bug;
The proposed solution worked, however I switched to a rhel image and it stopped working.
I noticed this image is not using walinuxagent as the solution states but waagent, so I tried to replacing that like the example below without any success.
bootcmd:
- mkdir -p /etc/systemd/system/waagent.service.d
- echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/waagent.service.d/override.conf
- sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
- systemctl daemon-reload
After this, also tried to set the runcmd steps to the bootcmd steps. This resulted in a boot which took ages and eventually froze.
Since I am not that familiar with rhel and Linux overall, I wanted to ask help if anyone might have some suggestions which I can additionally try.
(Apply some other configuration to ensure await on the cloud-final.service within a waagent?)

However the machine already had the state running, while still running the cloud configuration phase (cloud_config_modules).
Could you please be more specific? Where did you read the machine state?
The reason I ask is that cloud-init status will report status: running until cloud-init is done running, at which point it will report status: done
I what is the purpose of waiting until cloud-init is done? I'm not sure exactly what you are expecting to happen, but here are a couple of things that might help.
If you want to execute a script "at the end" of cloud-init initialization, you could put the script directly in runcmd, and if you want to wait for cloud-init in an external script you could do cloud-init status --wait, which will print a visual indicator and eventually return once cloud-init is complete.

On not too old Azure Linux VM images, cloud-init rather than WALinuxAgent acts as the VM provisioner. The VM is marked provisioned by the Azure cloud-init datasource module very early during cloud-init processing (source), before any cloud-init modules configurable with user data. WALinuxAgent is only responsible for provisioning Azure VM extensions. It does not appear to be possible to delay sending the 'VM ready' signal to Azure without modifying the VM image and patching the source code of cloud-init Azure datasource.

Related

Running a Node.js server in Azure Release Pipeline

I have a Node.js web server which, as part of a CD process, I want to deploy to a staging server using Azure Release Pipeline. The problem is, that if I just run a Powershell script:
# Run-Server.ps1
node my-server.js
The Pipeline will hold since the node process blocks the Powershell session.
What I want is to be able to launch the service, and then in the next deployment just kill the node process and run it again with the new code.
So I figured I'll use Start-Process. If I run it locally:
> Start-Process node -ArgumentList ./server.js
I can now exit the Powershell session and the server will continue running. So I thought I can implement it the same way in my Release Pipeline.
But it turns out that once the Release Pipeline ends running, the server is no longer available - the node process is gone.
Can you help me figure out why is that? Is there another way of achieving this? I suppose it's a pretty common use case so there must be best-practices out there regarding to how this should be done.
Another way to achieve this is to use a full-blown web server to host andmanage node process. I.e. on Windows you could use IIS with iisnode module. This is more reliable and gives you a few other benefits:
process management (automatic start, restart on failure, etc.)
security - you can configure the user that node process will run as
scalability on multi-core CPUs
Then the process of app deployment would be just copying files to the right directory - the web server should pick up the change automatically.
By default, A pipeline job cleans up all of the child processes it spins up when it exits. This is killing your node server.
Set Process.Clean variable to false to override the default behavior.

MDT step by step deployment capture not generating wim

New to MDT.
So I am following through the MS step by step guides:
https://learn.microsoft.com/en-us/windows/deployment/windows-10-poc
https://learn.microsoft.com/en-us/windows/deployment/windows-10-poc-mdt
I am at step 28 in (in the second guide):
Deploy Windows 10 in a test lab using Microsoft Deployment Toolkit
Where the deployment wizard has been launched in a VM on the host system and have watched the process continue for an hour. It finally finishes but it does not create the .wim on the the server share as
expected and as referred to in the bootstrap.ini:
Bootstrap.ini
[Settings]
Priority=Default
[Default]
DeployRoot=\\SRV1\MDTBuildLab$
UserDomain=CONTOSO
UserID=MDT_BA
UserPassword=pass#word1
SkipBDDWelcome=YES
I have verified that the share "DeployRoot" exists and can be connected to using the provided credentials and that the share has the correct permissions to create/delete files.
Not sure what I'm missing but my expectation was a .wim should have been created in \srv1\MDTBuildLab$\Captures but there is nothing in that folder.
Just before stopping the deployment wizard reboots several times in quick succession, which to me doesn't appear correct but as I have never witnessed a successful capture I can't say for sure this isn't what's supposed to happen.
I'm not even sure where I can view any log files to figure out why it fails.
Any assistance appreciated!
Further Info:
Activated monitoring. It gets to step 86 of 93. The last thing I see is "Applying WinPE (BD)" or something similar and then it restarts. Then several quick reboots occur (the loading bar appears for a second or two and then reboots) (Which I think are failing) finally it gives up! The process never completes!
When I attempt to mount the client REFW10X64-001.vhdx to check the logs I am greeted with this message
The disk image isn't initialized, contains partitions that aren't recognizable, or contains volumes that haven't been assigned drive letters. Please use the Disk Management snap-in to make sure that the disk, partitions, and volumes are in a usable state.
So it looks like the last step totally screwed the disk! Which would explain the last several boots failing to load anything.
So no errors no warnings, no logs, no finish and no wim generated.
How do I troubleshoot this?
I know this post is old, but the normal behavior would be as follows:
Using the boot image, you boot into WinPE
The task sequence is started and the OS gets applied to the disk
Reboot
Boot into full Windows where the task sequence also continues
Under full Windows, one of the last steps is that WinPE gets applied again
Reboot
Computer boots automatically into WinPE
The wim file gets created (WinPE is running on the RAM disk and the regular C: drive (and any additional drives) is being mirrored into the wim file)
Computer performs the FINISHACTION.
We would need at least BDD.log and smsts.log to further troubleshoot. My guess is that WinPE was not applied correctly.

Cloud Init on Google Compute Engine (GCE) with Centos7/8 doesn't run properly on First boot, but fine after any other reboot

We have a CentOS 8 (tried 7 as well) image and I am adding some config to act as a router.
The issue is, for some reason, the first time the instance is created, cloud init doesn't read the network config we pass using the user-data metadata
#cloud-config
network
version: 1
etc...
We configure eth1 to use dhcp and get cloud-init to manage it, as well as add a route.
Works perfectly every time after the initial boot up (and stop>start again).
To me it feels like cloud-init is not aware of the config, but when I go in the machine and do cloud-init query userdata i can see the data, and even then if I do cloud-init clean && cloud-init init it doesn't do anything. The same commands work fine if the machine was rebooted
Try running cloud-init analyze show both times (instance creation and consecutive reboot) and check for any differences.
Sadly, cloud providers kind-of abuse the abilities of cloud-init, not to a complete fault. cloud-init allows for customization of vendor/user provided configuration (who overrides what), changing the order of boot stages, etc.
This is done mostly because different cloud providers need network/provisioning/storage at different times. For example, AWS attaches storage after network (EBS only), Azure provides VM only after storage is attached and it's natively provided as NTFS (they really format the drive if you need anything else), etc.
These shenanigans, while understandable (datacenter infrastructure defines user availability) make cloud-init's documentation merely a suggestion for the user to investigate.
From my experience, Azure is the closest to original implementation. Possibly they haven't learned yet how to utilize the potential in their favor.
My general suggestion for any instance customization (almost always works) is to write a script with write_files and execute them with bootcmd/runcmd, because these run at the final stage, and provide for best override opportunity. Edit hosts, change firewall rules - most of the stuff will not require reboot.

Best way to deploy long-running high-compute app to GCP

I have a python app that builds a dataset for a machine learning task on GCP.
Currently I have to start an instance of a VM that we have, and then SSH in, and run the app, which will complete in 2-24 hours depending on the size of the dataset requested.
Once the dataset is complete the VM needs to be shutdown so we don't incur additional charges.
I am looking to streamline this process as much as possible, so that we have a "1 click" or "1 command" solution, but I'm not sure the best way to go about it.
From what I've read about so far it seems like containers might be a good way to go, but I'm inexperienced with docker.
Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down? How would I pass information to the container such as where to get the config file etc? It's conceivable that we will have multiple datasets being generated at the same time based on different config files.
Is there a better gcloud feature that suits our purpose more effectively than containers?
I'm struggling to get information regarding these basic questions, it seems like container tutorials are dominated by web apps.
It would be useful to have a batch-like container service that runs a container until its process completes. I'm unsure whether such a service exists. I'm most familiar with Google Cloud Platform and this provides a wealth of compute and container services. However -- to your point -- these predominantly scale by (HTTP) requests.
One possibility may be Cloud Run and to trigger jobs using Cloud Pub/Sub. I see there's async capabilities too and this may be interesting (I've not explored).
Another runtime for you to consider is Kubernetes itself. While Kubernetes requires some overhead in having Google, AWS or Azure manage a cluster for you (I strongly recommend you don't run Kubernetes yourself) and some inertia in the capacity of the cluster's nodes vs. the needs of your jobs, as you scale the number of jobs, you will smooth these needs. A big advantage with Kubernetes is that it will scale (nodes|pods) as you need them. You tell Kubernetes to run X container jobs, it does it (and cleans-up) without much additional management on your part.
I'm biased and approach the container vs image question mostly from a perspective of defaulting to container-first. In this case, you'd receive several benefits from containerizing your solution:
reproducible: the same image is more probable to produce the same results
deployability: container run vs. manage OS, app stack, test for consistency etc.
maintainable: smaller image representing your app, less work to maintain it
One (beneficial!?) workflow change if you choose to use containers is that you will need to build your images before using them. Something like Knative combines these steps but, I'd stick with doing-this-yourself initially. A common solution is to trigger builds (Docker, GitHub Actions, Cloud Build) from your source code repo. Commonly you would run tests against the images that are built but you may also run your machine-learning tasks this way too.
Your containers would container only your code. When you build your container images, you would pip install, perhaps pip install --requirement requirements.txt to pull the appropriate packages. Your data (models?) are better kept separate from your code when this makes sense. When your runtime platform runs containers for you, you provide configuration information (environment variables and|or flags) to the container.
The use of a startup script seems to better fit the bill compared to containers. The instance always executes startup scripts as root, thus you can do anything you like, as the command will be executed as root.
A startup script will perform automated tasks every time your instance boots up. Startup scripts can perform many actions, such as installing software, performing updates, turning on services, and any other tasks defined in the script.
Keep in mind that a startup script cannot stop an instance but you can stop an instance through the guest operating system.
This would be the ideal solution for the question you posed. This would require you to make a small change in your Python app where the Operating system shuts off when the dataset is complete.
Q1) Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down?
A1) Medium has a great article on installing a package from a private git repo inside a container. You can execute the dataset build before shutting down.
Q2) How would I pass information to the container such as where to get the config file etc?
A2) You can use ENV to set an environment variable. These will be available within the container.
You may consider looking into Docker for more information about container.

When is cloud-init run and how does it find its data?

I'm currently dealing with CoreOS, and so far I think I got the overall idea and concept. One thing that I did not yet get is execution of cloud-init.
I understand that cloud-init is a process that does some configuration for CoreOS. What I do not yet understand is…
When does CoreOS run cloud-init? On first boot? On each boot? …?
How does cloud-init know where to find its configuration data? I've seen that there is config-drive and that totally makes sense, but is this the only way? What exactly is the role of the user-data file? …?
CoreOS runs cloudinit a few times during the boot process. Right now this happens at each boot, but that functionality may change in the future.
The first pass is the OEM cloud-init, which is baked into the image to set up networking and other features required for that provider. This is done for EC2, Rackspace, Google Compute Engine, etc since they all have different requirements. You can see these files on Github.
The second pass is the user-data pass, which is handled differently per provider. For example, EC2 allows the user to input free-form text in their UI, which is stored in their metadata service. The EC2 OEM has a unit that reads this metadata and passes it to the second cloud-init run. On Rackspace/Openstack, config-drive is used to mount a read-only filesystem that contains the user-data. The Rackspace and Openstack OEMs know to mount and look for the user-data file at that location.
The latest version of CoreOS also has a flag to fetch a remote file to be evaluated for use with PXE booting.
The CoreOS distribution docs have a few more details as well.