Run multiple instance of a process on a number of servers - distributed-computing

I would like to run multiple instances of a randomized algorithm. For performance reason, I'd like to distribute the tasks on several machines.
Typically, I run my program as follows:
./main < input.txt > output.txt
and it takes about 30 minutes to return a solution.
I would like to run as many instances of this as possible, and ideally not change the code of the program. My questions are:
1 - What online services offer computing resources that would suit my need?
2 - Practically, how should I launch remotely all the processes, get notified of the termination, and then aggregate the results (basically, pick up the best solution). Is there a simple framework that I could use or should I look into ssh-based scripting?

1 - What online services offer computing resources that would suit my need?
Amazon EC2.
2 - Practically, how should I launch remotely all the processes, get notified of the termination, and then aggregate the results (basically, pick up the best solution). Is there a simple framework that I could use or should I look into ssh-based scripting?
Amazon EC2 has an API for launching virtual machines. Once they're launched, you can indeed use ssh to control jobs, and I would recommend this solution. I would expect that other softwares for distributed job management exist, but they aren't likely to be any simpler to configure than ssh.

Related

Best way to deploy long-running high-compute app to GCP

I have a python app that builds a dataset for a machine learning task on GCP.
Currently I have to start an instance of a VM that we have, and then SSH in, and run the app, which will complete in 2-24 hours depending on the size of the dataset requested.
Once the dataset is complete the VM needs to be shutdown so we don't incur additional charges.
I am looking to streamline this process as much as possible, so that we have a "1 click" or "1 command" solution, but I'm not sure the best way to go about it.
From what I've read about so far it seems like containers might be a good way to go, but I'm inexperienced with docker.
Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down? How would I pass information to the container such as where to get the config file etc? It's conceivable that we will have multiple datasets being generated at the same time based on different config files.
Is there a better gcloud feature that suits our purpose more effectively than containers?
I'm struggling to get information regarding these basic questions, it seems like container tutorials are dominated by web apps.
It would be useful to have a batch-like container service that runs a container until its process completes. I'm unsure whether such a service exists. I'm most familiar with Google Cloud Platform and this provides a wealth of compute and container services. However -- to your point -- these predominantly scale by (HTTP) requests.
One possibility may be Cloud Run and to trigger jobs using Cloud Pub/Sub. I see there's async capabilities too and this may be interesting (I've not explored).
Another runtime for you to consider is Kubernetes itself. While Kubernetes requires some overhead in having Google, AWS or Azure manage a cluster for you (I strongly recommend you don't run Kubernetes yourself) and some inertia in the capacity of the cluster's nodes vs. the needs of your jobs, as you scale the number of jobs, you will smooth these needs. A big advantage with Kubernetes is that it will scale (nodes|pods) as you need them. You tell Kubernetes to run X container jobs, it does it (and cleans-up) without much additional management on your part.
I'm biased and approach the container vs image question mostly from a perspective of defaulting to container-first. In this case, you'd receive several benefits from containerizing your solution:
reproducible: the same image is more probable to produce the same results
deployability: container run vs. manage OS, app stack, test for consistency etc.
maintainable: smaller image representing your app, less work to maintain it
One (beneficial!?) workflow change if you choose to use containers is that you will need to build your images before using them. Something like Knative combines these steps but, I'd stick with doing-this-yourself initially. A common solution is to trigger builds (Docker, GitHub Actions, Cloud Build) from your source code repo. Commonly you would run tests against the images that are built but you may also run your machine-learning tasks this way too.
Your containers would container only your code. When you build your container images, you would pip install, perhaps pip install --requirement requirements.txt to pull the appropriate packages. Your data (models?) are better kept separate from your code when this makes sense. When your runtime platform runs containers for you, you provide configuration information (environment variables and|or flags) to the container.
The use of a startup script seems to better fit the bill compared to containers. The instance always executes startup scripts as root, thus you can do anything you like, as the command will be executed as root.
A startup script will perform automated tasks every time your instance boots up. Startup scripts can perform many actions, such as installing software, performing updates, turning on services, and any other tasks defined in the script.
Keep in mind that a startup script cannot stop an instance but you can stop an instance through the guest operating system.
This would be the ideal solution for the question you posed. This would require you to make a small change in your Python app where the Operating system shuts off when the dataset is complete.
Q1) Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down?
A1) Medium has a great article on installing a package from a private git repo inside a container. You can execute the dataset build before shutting down.
Q2) How would I pass information to the container such as where to get the config file etc?
A2) You can use ENV to set an environment variable. These will be available within the container.
You may consider looking into Docker for more information about container.

What is the canonical way to deploy Scala/Akka microservices?

We are going to end up with dozens of these microservices (most are Akka-based), and I'm unsure how to best manage their deployment. Specifically, they are built to be independent of each other and as specialized and distributed as possible.
My question stems from the fact that all of them are too small for their own individual JVMs; even if we were to host them on AWS nano instances, we'll still end up with about 40 machines if you factor in redundancy, and such a high number is simply not needed. Three medium size instances could (and do) easily handle the entire workload.
Currently, I just group them into "container" applications, somewhat randomly, and then run these container applications on larger JVMs.
However, there has to be a better way. I am not aware of any application servers for Akka where you can just "deploy actors", so I wanted to get some insight on how others run Akka microservices in production (and specifically how to manage deployment).
This is probably not limited to Scala and Akka, but most other platforms have dedicated app servers where you deploy these things.
IMHO, the canonical way is to use a service orchestration tool, and that would indeed run them in individual processes, each with their own JVM.
That's the only way you get the decoupling, isolation, resilience you want with microservices, only this way you'll be able to deploy, update, stop, start them individually.
You're saying:
My question stems from the fact that all of them are too small for
their own individual JVMs; even if we were to host them on AWS nano
instances
You seem to treat JVM and Amazon VMs as equivalent, but that's not the case. You can have multiple JVM processes on a single virtual machine.
I suggest you have a look at service orchestration tools such as
Lightbend Production Suite / Service Orchestration
or Kubernetes
These are just examples, there are others. Note that this tool category will give you a lot of features you'll sooner or later need anyway, such as easy scaling, log consolidation, service lookup, health checks / service failure handling etc.

Good practices of Websphere MQ production deployment

I'm about to prepare a deployment specification for the Websphere MQ production environment. As always I hate reinventing the wheel hence the question:
Is there an article, specififaction of best practices when it comes to deploying and maintaining the Webshpere MQ production environment?
Here are more specific doubts of mine:
Configuration versioning (MQSC, dmpmqcfg, etc).
Deploying new objects (MQSC or manual instructions?)
Deployment automation (maybe basing on the diff of dmpmqcfg?).
Deploying and versioning configuration alterations.
Currently I am simply creating MQ objects manually and versioning the output of dmpmqcfg. However, in a while there are going to be too many deployments to handle it like this.
That's an extremely broad question so I'll try to respond before a moderator deletes it. :-)
The answer depends on many things such as whether MQ clusters are in use, the approaches to high availability and disaster recovery, the security requirements, whether the QMgrs are configured as dedicated or shared infrastructure, etc. However, there are a few patterns that I follow in almost all cases, including non-Production. This is because things like monitoring and security tend to get dropped at deployment time if not tested in Dev and don't work as expected in Prod.
I use a script to create my QMgrs in Production to insure that basics like generating the X.509 certificate (or CSR) is always done according to standards, that any exits or exit parm files are present, that certain SupportPac executables (like q) are present in /opt/mqm/bin, circular queues, etc. It also checks for negative factors such as GSKit not being installed.
I have a baseline script that is run against all QMgrs. This script sets up the DLQ, any queues for monitoring agents, enables events as required, sets up system services, trigger monitors, listeners, etc. The exception is B2B gateway QMgrs which are handled in a class all their own and have very specific configurations not used on the internal network.
cluster.
I have several classes of QMgr with specific configuration requirements. These include cluster repositories (where primary and secondary are distinct sub-types), service-provider QMgrs, and service consumer QMgrs. These all have secondary scripts run against them.
I have scripts per-cluster to join or suspend a QMgr in cases where clustering is used (which for me is almost 100% since v7.1).
These set up a QMgr's infrastructure. Then I maintain scripts for each application. So for example, if there's a Payroll app, I'll have queues and possibly topics with names containing a PAY node such as PAY.EMPLOYEE.UPDT.REQ.V032.PRD. Corresponding to that will be a single script for all PAY.** queues. Used to be one for setmqaut commands too, but these are now in the same script as the objects. I only ever have one version of the script and keep a history of changes in the script. This way when I need to recreate a QMgr, I just run all the scripts for it. Similarly, if I need to deploy the PAY objects on another QMgr, I just copy the script to that server.
When defining objects for clusters, I always do a DEFINE NOREPLACE that contains all the run-time attributes such as whether the queue is enabled in the cluster. The queue is always defined as disabled in the cluster and for triggering but because I use NOREPLACE re-running the script doesn't change whatever state it has in, say, a month. Those things that are configuration and not run-time, such as the description, are handled in an ALTER immediately after the DEFINE and these are updated each time the script is run. There's an article on this here.
Finally, the scripts I use are of the self-executing, self-documenting variety. For example, many people put all the MQSC commands into a script then do something like:
runmqsc < payroll.mqsc > payroll.out
TONS of problems here. The main one is that it relies on the operator to know a lot and execute the script right all the time. For example, suppose (s)he forgets to capture the output? Or overwrites a previous output? Or doesn't get STDERR because (s)he needs to do the 2>&1 at the end and doesn't know redirection that well?
So my scripts are all written in ksh handle all the capturing of output, complete with time and date stamping and STDERR, can freely mix MQSC with OS commands, etc. All you do is go to the scripts directory for that QMgr and . ./*ksh to build/rebuild a QMgr.
I do of course also take regular configuration dumps, but these are more for running queries and reports like "how many QMgrs have this channel defined and where are they?" kind of thing.
Also, when taking backups there is almost NEVER a good reason to back up a QMgr at a point in time. However, if it is required be sure to stop the QMgr first. Also, think long and hard about capturing certificates in a backup. Many people are good about locking the certificate directory so only mqm can read it but often the backups are unprotected. As long as you aren't trying to restore on top of Production, many shops let you restore the Production /var/mqm/* files to your own sandbox. If the QMgr's KDB files are included, you just lost them. An alternative is to put the certificates in /etc or some other directory that is protected but not backed up with the QMgr's directories.

How does one create a cluster from different servers in MATLAB?

I have servers A, B & C with 8 cores each. Right now, I'm running jobs in parallel on a single server with 8 workers. Is there any way I can harness the power of all three for a single job? The servers are all inter-accessible via ssh (all three exist behind a gateway, so no password required either)
I'm going to assume you're currently using Parallel Computing Toolbox. To use multiple servers together, you need the following things:
MATLAB Distributed Computing Server licences for the MATLAB workers running on the machines.
Some sort of scheduler to schedule the jobs across the machines. MDCS comes with a basic scheduler, call the "Jobmanager". There are also various freely available schedulers for Linux systems such as Torque.
The installation instructions for MDCS are quite detailed and will lead you through all the stages you need to complete to get parallel jobs running.

Queuing systems - what is a good way to start up multiple workers?

How have you set-up one or more worker scripts for queue-oriented systems?
How do you arrange to startup - and restart if necessary - worker scripts as required? (I'm thinking about such tools as init.d/, Ruby-based 'god', DJB's Daemontools, etc, etc)
I'm developing an asynchronous queue/worker system, in this case using PHP & BeanstalkdD (though the actual language and daemon isn't important). The tasks themselves are not too hard - encoding an array with the commands and parameters into JSON for transport through the Beanstalkd daemon, picking them up in a worker script to action them as required.
There are a number of other similar queue/worker setups out there, such as Starling, Gearman, Amazon's SQS and other more 'enterprise' oriented systems like IBM's MQ and RabbitMQ. If you run something like Gearman, or SQS - how do you start and control the worker pool? The questions is on the initial worker startup, and then being able to add additional extra workers, shutting them down at will (though I can send a message through the queue to shut them down - as long as some 'watcher' won't automatically restart them). This is not a PHP problem, it's about straight Unix processes of setting up one or more processes to run on startup, or adding more workers to the pool.
A bash script to loop a script is already in place - this calls the PHP script which then collects and runs tasks from the queue, occasionally exiting to be able to clean itself up (it can also pause a few seconds on failure, or via a planned event). This works fine, and building the worker processes on top of that won't be very hard at all.
Getting a good worker controller system is about flexibility, starting one or two automatically on a machine start, and being able to add a couple more from the command line when the queue is busy, shutting down the extras when no longer required.
I've been helping a friend who's working on a project that involves a Gearman-based queue that will dispatch various asynchronous jobs to various PHP and C daemons on a pool of several servers.
The workers have been designed to behave just like classic unix/linux daemons, thanks to simple shell scripts in /etc/init.d/, and commands like :
invoke-rc.d myWorker start|stop|restart|reload
This mechanism is simple and efficient. And as it relies on standard linux features, even people with a limited knowledge of your app can launch a daemon or stop one, if they know how it's called system-wise (aka "myWorker" in the above example).
Another advantage of this mechanism is it makes your workers pool management easy as well. You could have 10 daemons on your machine (myWorker1, myWorker2, ...) and have a "worker manager" start or stop them depending on the queue length. And as these commands can be run through ssh, you can easily manage several servers.
This solution may sound cheap, but if you build it with well-coded daemons and reliable management scripts, I don't see why it would be less efficient than big-bucks solutions, for any average (as in "non critical") project.
Real message queuing middleware like WebSphere MQ or MSMQ offer "triggers" where a service that is part of the MQM will start a worker when new messages are placed into a queue.
AFAIK, no "web service" queuing system can do that, by the nature of the beast. However I have only looked hard at SQS. There you have to poll the queue, and in Amazon's case overly eager polling is going to cost you some real $$.
I've recently been working on such a tool. It's not entirely finished (thought it should take more than a few more days before I hit something I could call 1.0) and clearly not ready for production yet, but the important part are already coded. Anybody can have a look at the code here: https://gitorious.org/workers_pool.
Supervisor is a good monitor tool. It includes a web UI where you can monitor and manage workers.
Here is a simple config file for a worker.
[program:demo]
command=php worker.php ; php command to run worker file
numprocs=2 ; number of processes
process_name=%(program_name)s_%(process_num)03d ; unique name for each process if numprocs > 1
directory=/var/www/demo/ ; directory containing worker file
stdout_logfile=/var/www/demo/worker.log ; log file location
autostart=true ; auto start program when supervisor starts
autorestart=true ; auto restart program if it exits