How does chef-solo --daemonize work, and what's the point? - chef-solo

I understand the purpose of chef-client --daemonize, because it's a service that Chef Server can connect to and interact with.
But chef-solo is a command that simply brings the current system inline with specifications and then is done.
So what is the point of chef-solo --daemonize, and what specifically does it do? For example, does it autodetect when the system falls out of line with spec? Does it do so via polling or tapping into filesystem events? How does it behave if you update the cookbooks and node files it depends on when it's already running?

You might also ask why chef-solo supports the --splay and --interval arguments.
Don't forget that chef-server is not the only source of data.
Configuration values can rely on a bunch of other sources (APIs, OHAI, DNS...).
The most classic one is OHAI - think of a cookbook that configures memcached. You would probably want to keep X amount of RAM for the operating system and the rest goes to memcached.
Available RAM can be changed when running inside a VM, even without rebooting it.
That might be a good reason to run chef-solo as a daemon with frequent chef-runs, like you're used to when using chef-client with a chef-server.
As for your other questions:
Q: Does it autodetect when the system falls out of line with spec?
Does it do so via polling or tapping into filesystem events?
A: Chef doesn't respond to changes. Instead, it runs frequently and makes sure the current state is in sync with the desired state - which can be based on chef-server inventory, API calls, OHAI attributes, etc. The desired state is constructed from scratch every time you run Chef, at the compile stage when all the resources are generated. Read about it here
Q: How does it behave if you update the cookbooks and node files it depends on when it's already running?
A: Usually when running chef-solo, one uses the --json flag to specify a JSON file with node attributes and a run-list. When running in --daemonize mode with chef-solo, the node attributes are read only for the first run. For the rest of the runs, it's as if you were running it without a --json flag. I couldn't figure out a way to make it work as if you were running it with --json all over again, however, you can use the --override-runlist option to at least make the runlist stick.
Note that the attributes you're specifying in your JSON won't make it past the first run. This is possibly a bug.

Related

Cloud Init on Google Compute Engine (GCE) with Centos7/8 doesn't run properly on First boot, but fine after any other reboot

We have a CentOS 8 (tried 7 as well) image and I am adding some config to act as a router.
The issue is, for some reason, the first time the instance is created, cloud init doesn't read the network config we pass using the user-data metadata
#cloud-config
network
version: 1
etc...
We configure eth1 to use dhcp and get cloud-init to manage it, as well as add a route.
Works perfectly every time after the initial boot up (and stop>start again).
To me it feels like cloud-init is not aware of the config, but when I go in the machine and do cloud-init query userdata i can see the data, and even then if I do cloud-init clean && cloud-init init it doesn't do anything. The same commands work fine if the machine was rebooted
Try running cloud-init analyze show both times (instance creation and consecutive reboot) and check for any differences.
Sadly, cloud providers kind-of abuse the abilities of cloud-init, not to a complete fault. cloud-init allows for customization of vendor/user provided configuration (who overrides what), changing the order of boot stages, etc.
This is done mostly because different cloud providers need network/provisioning/storage at different times. For example, AWS attaches storage after network (EBS only), Azure provides VM only after storage is attached and it's natively provided as NTFS (they really format the drive if you need anything else), etc.
These shenanigans, while understandable (datacenter infrastructure defines user availability) make cloud-init's documentation merely a suggestion for the user to investigate.
From my experience, Azure is the closest to original implementation. Possibly they haven't learned yet how to utilize the potential in their favor.
My general suggestion for any instance customization (almost always works) is to write a script with write_files and execute them with bootcmd/runcmd, because these run at the final stage, and provide for best override opportunity. Edit hosts, change firewall rules - most of the stuff will not require reboot.

Best way to deploy long-running high-compute app to GCP

I have a python app that builds a dataset for a machine learning task on GCP.
Currently I have to start an instance of a VM that we have, and then SSH in, and run the app, which will complete in 2-24 hours depending on the size of the dataset requested.
Once the dataset is complete the VM needs to be shutdown so we don't incur additional charges.
I am looking to streamline this process as much as possible, so that we have a "1 click" or "1 command" solution, but I'm not sure the best way to go about it.
From what I've read about so far it seems like containers might be a good way to go, but I'm inexperienced with docker.
Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down? How would I pass information to the container such as where to get the config file etc? It's conceivable that we will have multiple datasets being generated at the same time based on different config files.
Is there a better gcloud feature that suits our purpose more effectively than containers?
I'm struggling to get information regarding these basic questions, it seems like container tutorials are dominated by web apps.
It would be useful to have a batch-like container service that runs a container until its process completes. I'm unsure whether such a service exists. I'm most familiar with Google Cloud Platform and this provides a wealth of compute and container services. However -- to your point -- these predominantly scale by (HTTP) requests.
One possibility may be Cloud Run and to trigger jobs using Cloud Pub/Sub. I see there's async capabilities too and this may be interesting (I've not explored).
Another runtime for you to consider is Kubernetes itself. While Kubernetes requires some overhead in having Google, AWS or Azure manage a cluster for you (I strongly recommend you don't run Kubernetes yourself) and some inertia in the capacity of the cluster's nodes vs. the needs of your jobs, as you scale the number of jobs, you will smooth these needs. A big advantage with Kubernetes is that it will scale (nodes|pods) as you need them. You tell Kubernetes to run X container jobs, it does it (and cleans-up) without much additional management on your part.
I'm biased and approach the container vs image question mostly from a perspective of defaulting to container-first. In this case, you'd receive several benefits from containerizing your solution:
reproducible: the same image is more probable to produce the same results
deployability: container run vs. manage OS, app stack, test for consistency etc.
maintainable: smaller image representing your app, less work to maintain it
One (beneficial!?) workflow change if you choose to use containers is that you will need to build your images before using them. Something like Knative combines these steps but, I'd stick with doing-this-yourself initially. A common solution is to trigger builds (Docker, GitHub Actions, Cloud Build) from your source code repo. Commonly you would run tests against the images that are built but you may also run your machine-learning tasks this way too.
Your containers would container only your code. When you build your container images, you would pip install, perhaps pip install --requirement requirements.txt to pull the appropriate packages. Your data (models?) are better kept separate from your code when this makes sense. When your runtime platform runs containers for you, you provide configuration information (environment variables and|or flags) to the container.
The use of a startup script seems to better fit the bill compared to containers. The instance always executes startup scripts as root, thus you can do anything you like, as the command will be executed as root.
A startup script will perform automated tasks every time your instance boots up. Startup scripts can perform many actions, such as installing software, performing updates, turning on services, and any other tasks defined in the script.
Keep in mind that a startup script cannot stop an instance but you can stop an instance through the guest operating system.
This would be the ideal solution for the question you posed. This would require you to make a small change in your Python app where the Operating system shuts off when the dataset is complete.
Q1) Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down?
A1) Medium has a great article on installing a package from a private git repo inside a container. You can execute the dataset build before shutting down.
Q2) How would I pass information to the container such as where to get the config file etc?
A2) You can use ENV to set an environment variable. These will be available within the container.
You may consider looking into Docker for more information about container.

Can I configure icecream (icecc) to do zero local jobs

I'm trying to build a project on a rather underpowered system (intel compute stick with 1GB of RAM). Some of the compilation steps run out of memory. I've configured icecc so that it can send some jobs to a more powerful machine, but it seems that icecc will always do at least one job on the local machine.
I've tried setting ICECC_MAX_JOBS="0" in /etc/icecc/icecc.conf (and restarting iceccd), but the comments in this file say:
# Note: a value of "0" is actually interpreted as "1", however it
# also sets ICECC_ALLOW_REMOTE="no".
I also tried disabling the icecc daemon on the compute stick by running /etc/init.d/icecc stop. However, it seems that icecc is still putting one job on the local machine (perhaps if the daemon is off it's putting all jobs on the local machine?).
The project is makefile based and it appears that I'm stuck on a bottleneck step where calling make with -j > 1 still only issues one job, and this compilation is expiring the system memory.
The only work around I can think of is to actually compile on a different system and then ship the binaries back over but I expect to enter a tweak/build/evaluate cycle on this platform so I'd like to be able to work from the compute stick directly.
Both systems are running ubuntu 14.04 if that helps.
I believe it is not supported since if there are network issues, icecc resorts to compiling on the host machine itself. Best solution would be to compile on the remote machine and copy back the resulting binary.
Have you tried setting ICECC_TEST_REMOTEBUILD in client's terminal (where you run make)?
export ICECC_TEST_REMOTEBUILD=1
In my tests this always forces all sources to be compiled remotely.
Just remember that linking is always done on local machine.

When is cloud-init run and how does it find its data?

I'm currently dealing with CoreOS, and so far I think I got the overall idea and concept. One thing that I did not yet get is execution of cloud-init.
I understand that cloud-init is a process that does some configuration for CoreOS. What I do not yet understand is…
When does CoreOS run cloud-init? On first boot? On each boot? …?
How does cloud-init know where to find its configuration data? I've seen that there is config-drive and that totally makes sense, but is this the only way? What exactly is the role of the user-data file? …?
CoreOS runs cloudinit a few times during the boot process. Right now this happens at each boot, but that functionality may change in the future.
The first pass is the OEM cloud-init, which is baked into the image to set up networking and other features required for that provider. This is done for EC2, Rackspace, Google Compute Engine, etc since they all have different requirements. You can see these files on Github.
The second pass is the user-data pass, which is handled differently per provider. For example, EC2 allows the user to input free-form text in their UI, which is stored in their metadata service. The EC2 OEM has a unit that reads this metadata and passes it to the second cloud-init run. On Rackspace/Openstack, config-drive is used to mount a read-only filesystem that contains the user-data. The Rackspace and Openstack OEMs know to mount and look for the user-data file at that location.
The latest version of CoreOS also has a flag to fetch a remote file to be evaluated for use with PXE booting.
The CoreOS distribution docs have a few more details as well.

Good practices of Websphere MQ production deployment

I'm about to prepare a deployment specification for the Websphere MQ production environment. As always I hate reinventing the wheel hence the question:
Is there an article, specififaction of best practices when it comes to deploying and maintaining the Webshpere MQ production environment?
Here are more specific doubts of mine:
Configuration versioning (MQSC, dmpmqcfg, etc).
Deploying new objects (MQSC or manual instructions?)
Deployment automation (maybe basing on the diff of dmpmqcfg?).
Deploying and versioning configuration alterations.
Currently I am simply creating MQ objects manually and versioning the output of dmpmqcfg. However, in a while there are going to be too many deployments to handle it like this.
That's an extremely broad question so I'll try to respond before a moderator deletes it. :-)
The answer depends on many things such as whether MQ clusters are in use, the approaches to high availability and disaster recovery, the security requirements, whether the QMgrs are configured as dedicated or shared infrastructure, etc. However, there are a few patterns that I follow in almost all cases, including non-Production. This is because things like monitoring and security tend to get dropped at deployment time if not tested in Dev and don't work as expected in Prod.
I use a script to create my QMgrs in Production to insure that basics like generating the X.509 certificate (or CSR) is always done according to standards, that any exits or exit parm files are present, that certain SupportPac executables (like q) are present in /opt/mqm/bin, circular queues, etc. It also checks for negative factors such as GSKit not being installed.
I have a baseline script that is run against all QMgrs. This script sets up the DLQ, any queues for monitoring agents, enables events as required, sets up system services, trigger monitors, listeners, etc. The exception is B2B gateway QMgrs which are handled in a class all their own and have very specific configurations not used on the internal network.
cluster.
I have several classes of QMgr with specific configuration requirements. These include cluster repositories (where primary and secondary are distinct sub-types), service-provider QMgrs, and service consumer QMgrs. These all have secondary scripts run against them.
I have scripts per-cluster to join or suspend a QMgr in cases where clustering is used (which for me is almost 100% since v7.1).
These set up a QMgr's infrastructure. Then I maintain scripts for each application. So for example, if there's a Payroll app, I'll have queues and possibly topics with names containing a PAY node such as PAY.EMPLOYEE.UPDT.REQ.V032.PRD. Corresponding to that will be a single script for all PAY.** queues. Used to be one for setmqaut commands too, but these are now in the same script as the objects. I only ever have one version of the script and keep a history of changes in the script. This way when I need to recreate a QMgr, I just run all the scripts for it. Similarly, if I need to deploy the PAY objects on another QMgr, I just copy the script to that server.
When defining objects for clusters, I always do a DEFINE NOREPLACE that contains all the run-time attributes such as whether the queue is enabled in the cluster. The queue is always defined as disabled in the cluster and for triggering but because I use NOREPLACE re-running the script doesn't change whatever state it has in, say, a month. Those things that are configuration and not run-time, such as the description, are handled in an ALTER immediately after the DEFINE and these are updated each time the script is run. There's an article on this here.
Finally, the scripts I use are of the self-executing, self-documenting variety. For example, many people put all the MQSC commands into a script then do something like:
runmqsc < payroll.mqsc > payroll.out
TONS of problems here. The main one is that it relies on the operator to know a lot and execute the script right all the time. For example, suppose (s)he forgets to capture the output? Or overwrites a previous output? Or doesn't get STDERR because (s)he needs to do the 2>&1 at the end and doesn't know redirection that well?
So my scripts are all written in ksh handle all the capturing of output, complete with time and date stamping and STDERR, can freely mix MQSC with OS commands, etc. All you do is go to the scripts directory for that QMgr and . ./*ksh to build/rebuild a QMgr.
I do of course also take regular configuration dumps, but these are more for running queries and reports like "how many QMgrs have this channel defined and where are they?" kind of thing.
Also, when taking backups there is almost NEVER a good reason to back up a QMgr at a point in time. However, if it is required be sure to stop the QMgr first. Also, think long and hard about capturing certificates in a backup. Many people are good about locking the certificate directory so only mqm can read it but often the backups are unprotected. As long as you aren't trying to restore on top of Production, many shops let you restore the Production /var/mqm/* files to your own sandbox. If the QMgr's KDB files are included, you just lost them. An alternative is to put the certificates in /etc or some other directory that is protected but not backed up with the QMgr's directories.