Lite virtualization of processes - virtualization

I'm trying to figure out (if it is possible) how to run an user process in an insulated context (memory, network and other resources).
Let's assume that the program x is stored into a filesystem of an host machine (h).
I would like to execute x in an insulated hosted context (c) (in other words, without creation of virtual hosted OSs).
The process elaborates output files into is context c. Then i would like to use those files into h context.
I heared about LXC, docker, dockerlite, openvz, etc. but it seems that one must create a container starting from an OS image.
So, shortly, is there a way to run x into c and get results (if any) into h?

Using Docker you could create c (a container) and share a directory from the host (h) where you put your results from x. Please see the volume docs on docs.docker.io.
c doesn't need to contain a full OS image. The busybox base container, for example, is about 2.5MB.

Related

Install Postgres on removable volume on linux?

Cloud platforms like Linode.com often provide hot-pluggable storage volumes that you can easily attach and detach from a Linux virtual machine without restarting it.
I am looking for a way to install Postgres so that its data and configuration ends up on a volume that I have mounted to the virtual machine. The end result should allow me to shut down the machine, detach the volume, spin up another machine with an identical version of Postgres already installed, attach the volume and have Postgres work just like it did on the old machine with all the data, file system permissions and server-wide configuration intact.
Is such a thing possible? Is there a reliable way to move installations (i.e databases and configuration, not the actual binaries) of Postgres across machines?
CLARIFICATION: the virtual machine has two disks:
the "built-in" one which is created when the VM is created and mounted to /. That's where Postgres gets installed to and you can't move this disk.
the hot-pluggable disk which you can easily attach and detach from a running VM. This is where I want Postgres data and configuration to be so I can just detach the disk (after shutting down the VM to prevent data loss/corruption) and attach it to another VM when I want my data to move so it behaves like it did on the old VM (i.e. no failures to start Postgres, no errors about permissions or missing files, etc).
This works just fine. It is not really any different to starting and stopping PostgreSQL and not removing the disk. There are a couple of things to consider though.
You have to make sure it is stopped + writing synced before unmounting the volume. Obvious enough, and I can't believe you'd be able to unmount before sync completed, but worth repeating.
You will want the same version of PostgreSQL, probably on the same version of operating system with the same locales too. Different distributions might compile it with different options.
Although you can put configuration and data in the same directory hierarchy, most distros tend to put config in /etc. If you compile from source yourself this won't be a problem. Alternatively, you can usually override the default locations or, and this is probably simpler, bind-mount the data and config directories into the places your distro expects.
Note that if your storage allows you to connect the same volume to multiple hosts in some sort of "read only" mode that won't work.
Edit: steps from comment moved into body for easier reading.
start up PG, create a table put one row in it.
Stop PG.
Mount your volume at /mnt/db
rsync /var/lib/postgresql/NN/main to /mnt/db/pg_data and /etc/postgresql/NN/main to /mnt/db/pg_etc
rename /var/lib/postgresql/NN/main and add .OLD to the name and do the same with the /etc
bind-mount the dirs from /mnt to replace them
restart PG
Test
Repeat
Return to step 8 until you are happy

Best way to deploy long-running high-compute app to GCP

I have a python app that builds a dataset for a machine learning task on GCP.
Currently I have to start an instance of a VM that we have, and then SSH in, and run the app, which will complete in 2-24 hours depending on the size of the dataset requested.
Once the dataset is complete the VM needs to be shutdown so we don't incur additional charges.
I am looking to streamline this process as much as possible, so that we have a "1 click" or "1 command" solution, but I'm not sure the best way to go about it.
From what I've read about so far it seems like containers might be a good way to go, but I'm inexperienced with docker.
Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down? How would I pass information to the container such as where to get the config file etc? It's conceivable that we will have multiple datasets being generated at the same time based on different config files.
Is there a better gcloud feature that suits our purpose more effectively than containers?
I'm struggling to get information regarding these basic questions, it seems like container tutorials are dominated by web apps.
It would be useful to have a batch-like container service that runs a container until its process completes. I'm unsure whether such a service exists. I'm most familiar with Google Cloud Platform and this provides a wealth of compute and container services. However -- to your point -- these predominantly scale by (HTTP) requests.
One possibility may be Cloud Run and to trigger jobs using Cloud Pub/Sub. I see there's async capabilities too and this may be interesting (I've not explored).
Another runtime for you to consider is Kubernetes itself. While Kubernetes requires some overhead in having Google, AWS or Azure manage a cluster for you (I strongly recommend you don't run Kubernetes yourself) and some inertia in the capacity of the cluster's nodes vs. the needs of your jobs, as you scale the number of jobs, you will smooth these needs. A big advantage with Kubernetes is that it will scale (nodes|pods) as you need them. You tell Kubernetes to run X container jobs, it does it (and cleans-up) without much additional management on your part.
I'm biased and approach the container vs image question mostly from a perspective of defaulting to container-first. In this case, you'd receive several benefits from containerizing your solution:
reproducible: the same image is more probable to produce the same results
deployability: container run vs. manage OS, app stack, test for consistency etc.
maintainable: smaller image representing your app, less work to maintain it
One (beneficial!?) workflow change if you choose to use containers is that you will need to build your images before using them. Something like Knative combines these steps but, I'd stick with doing-this-yourself initially. A common solution is to trigger builds (Docker, GitHub Actions, Cloud Build) from your source code repo. Commonly you would run tests against the images that are built but you may also run your machine-learning tasks this way too.
Your containers would container only your code. When you build your container images, you would pip install, perhaps pip install --requirement requirements.txt to pull the appropriate packages. Your data (models?) are better kept separate from your code when this makes sense. When your runtime platform runs containers for you, you provide configuration information (environment variables and|or flags) to the container.
The use of a startup script seems to better fit the bill compared to containers. The instance always executes startup scripts as root, thus you can do anything you like, as the command will be executed as root.
A startup script will perform automated tasks every time your instance boots up. Startup scripts can perform many actions, such as installing software, performing updates, turning on services, and any other tasks defined in the script.
Keep in mind that a startup script cannot stop an instance but you can stop an instance through the guest operating system.
This would be the ideal solution for the question you posed. This would require you to make a small change in your Python app where the Operating system shuts off when the dataset is complete.
Q1) Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down?
A1) Medium has a great article on installing a package from a private git repo inside a container. You can execute the dataset build before shutting down.
Q2) How would I pass information to the container such as where to get the config file etc?
A2) You can use ENV to set an environment variable. These will be available within the container.
You may consider looking into Docker for more information about container.

How to handle file paths in distributed environment

I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.
As I'm writing my task, I've got something like this:
#app.task
def do_ocr(pk, file_path):
content = run_tesseract_command(file_path)
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.
If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node
For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file
sshfs#user#node2:/var/your/app/pdfs /mnt/node2 fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
....
sshfs#user#nodeN:/var/your/app/pdfs /mnt/nodeN fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
In current node (node1) create a symlink named as current server pointing to pdf's path
ln -s /var/your/app/pdfs node1
Your mnt folder should contain remote's filesystem and a symlink
user#node1:/mnt$ ls -lsa
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node1 -> /var/your/app/pdfs
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node2
...
0 lrwxrwxrwx 1 user user 16 apr 12 2016 nodeN
Then your function should look like this:
import os
MOUNT_POINT = '/mtn'
#app.task
def do_ocr(pk, node_name, file_path):
content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
It works like all files are in the current machine but there's remote-logic working for you transparently
Well, there are multiple ways to handle it, but let's stick to one of the simpliest one:
since you'd like to process big amount of files using multiple servers, my first suggestion would be to use the same OS in each server, so you won't have to worry about cross-platform compatibility
using the word 'cluster' indicates that all of those servers should know their mutual state - it adds complexity, try to switch to the farm of stateless workers (by 'stateless' I mean "not knowing about other's" as they should be aware of at least their own state, e.g.: IDLE, IN_PROGRESS, QUEUE_FULL or more if needed)
for the file list processing part you could use pull or push model:
push model could be easily implemented by a simple app that crawls the files and dispatches them (e.g.: over SCP, FTP, whatever) to a set of available servers; servers can monitor their local directories for changes and pick up new files to process; it's also very easy to scale - just spin up more servers and update the push client (even in runtime); the only limit is your push client's performance
pull model is a little bit more tricky, cause you have to handle more complexity; having a set of servers implicates having a proper starting index per node and offset - it will make error handling more difficult, plus, it doesn't scale easily (imagine adding twice as more servers to speedup the processing and updating indices and offsets properly on each node.. seems like an error-prone solution)
I assume that the network traffic isn't a big concern - having 3M files to process will generate it somewhere, one way or the other..
collecting/storing the results is a different ballpark, but here the list of possible solutions is limitless
Since I miss a lot of your architecture details and your application specifics, you can take this answer as a guiding answer rather than a strict one.
You can take this approach, in the following order:
1- deploy an internal file server that stores all the files in one place and serve them
Example:
http://interanal-ip-address/storage/filenameA.pdf
http://interanal-ip-address/storage/filenameB.pdf
http://interanal-ip-address/storage/filenameC.pdf
and so on ...
2- Install/Deploy Redis
3- Create an upload client/service/process that takes the files you want to upload and pass them to the above storage location (/storage/), so your files will be available once they are uploaded, at the same time push the full file path URL to a predefined Redis List/Queue (build on linked lists data structure), like this: http://internal-ip-address/storage/filenameA.pdf
You can get more details here about LPUSH and RPOP under Redis Lists here: http://redis.io/topics/data-types-intro
Examples:
A file upload form, that stores the files directly to storage area
A file upload utility/command-line/background-process, that you can create it yourself or use some existing tool to upload files to the storage location, that gets the files from specific location, be it a web address or some other server that has your files
4- Now we come to your celery workers, each one of your workers should pull (RPOP) one of the files URLs from Redis queue, download the file from your internal file server (we built in first step), and do the required processing on the way you wanted it to be.
An important thing to note from Redis documentation:
Lists have a special feature that make them suitable to implement
queues, and in general as a building block for inter process
communication systems: blocking operations.
However it is possible that sometimes the list is empty and there is
nothing to process, so RPOP just returns NULL. In this case a consumer
is forced to wait some time and retry again with RPOP. This is called
polling, and is not a good idea in this context because it has several
drawbacks
So Redis implements commands called BRPOP and BLPOP which are versions
of RPOP and LPOP able to block if the list is empty: they'll return to
the caller only when a new element is added to the list, or when a
user-specified timeout is reached.
Let me know if that answers your question.
Things to keep in mind
You can add as many workers as you want since this solution is very
scalable, and your only bottleneck is Redis server, which you can make cluster of and persist your queue in case of power outage or server crash
You can replace redis with RabbitMQ, Beanstalk, Kafka, or any other queuing/messaging system, but Redis has ben nominated in this race due to simplicity and meriad of features introduced out of the box.

How can I specify where my local developer's service fabric cluster is created?

My problem: I am learning Service Fabric, and doing simple tutorials, and the local cluster is filling up my C drive. I run the projects in Visual Studio. It first creates a cluster in a folder SfDevCluster. That takes up 842 MB of space. Then it deploys the services and web api sites. Remember, these are trivial tutorials with almost nothing in them. Now, I notice that I have a folder with a Size = 1.22 TB and Size on Disk of 9.4 GB. I'm not sure how to interpret that. But it consumes the remaining space on my C drive and sets off alarms.
I have other drives with lots of space. I would love to specify that those be used. Is there a way to do that with the service fabric cluster used by Visual Studio? Or is there a way to constrain the overly ambitious size allocations? And if you understand this, can you explain what these unusual folder sizes mean?
In the old days, I would have a hard drive with lots of space. But now, my developer machine has a much faster, but more expensive SSD drive, and space is at a premium. So I need more control of the cluster location.
You can set up a local cluster pointing to a non-system drive by running the DevClusterSetup script in PowerShell. You can find the script under %programfiles%\Microsoft SDKs\Service Fabric\ClusterSetup\. The command line you want is:
.\DevClusterSetup.ps1 -PathToClusterDataRoot <desired_app_and_data_location> -PathToClusterLogRoot <desired_tracelog_location>
If you already have a cluster running, this script will remove it and create a new one (note that this will delete any deployed apps and their data). Once you have the new cluster running, Visual Studio will automatically use that when you deploy locally.
As for the file sizes - this is mostly due to the log file used for replication of state stored in reliable collections. A large, sparse file is preallocated up-front, which is why you see a difference between size and size on disk. We are planning to make these values configurable so that they can be dialed down on local clusters.
In the Service Fabric SDK folder (C:\Program Files\Microsoft SDKs\ServiceFabric), you will find a ClusterSetup folder.
In there you will find ClusterManifestTemplate.json files for the different configurations of the local cluster. These are json configuration files used by the powershell scripts that create and manage the local service fabric cluster.
At the bottom of these files, in "fabricSettings" it is setting the value of the FabricDataRoot and FabricLogRoot, based on the "%systemDrive%". If you replace this by "D:" it should result in a local cluster on the D drive.
After making these changes, I stopped my local fabric, deleted the current fabric folders from my C drive, and rebooted my machine. When I then start a debug session in VS.2017, it creates the local dev fabric on my D drive and deploys the application to that location. (I do notice that some empty folders are created on my C drive but these are not used.)
What you also can do is resetting the local cluster once in a while.
Can be easily done using the Service Fabric Local Cluster Manager application:

Copying a virtual machine data drive in Microsoft Azure

Added more details at the bottom of the question.
We are testing deployment scenarios in Azure VM preview and have run into an issue.
Here is our scenario. We have a software stack that we use in all of our servers. We have created an image with all of that stack installed on an attached data drive. We have created a image of the VM that we can use as a template. Now what we want to do is to to create a VM based on that template and create a copy of the data drive and attach it to the newly created VM in an automated manner.
Our problem is that while we have found lots of information about creating drives, we can't find any guidance on how to copy the data drive using Azure for Powershell.
Any thoughts, code, or RTFMs happily accepted.
Cheers,
Terence
We have sucessfully created an operating system image that we can use to create VM's. But there is a data disk that holds our standard software stack that we want to reuse by copying it across VMs. The scenario that we are trying to implement is:
Create a VM from a standard VM image - PBIMaster
Attach a disk as F to that image called PBIMasterDisk
Install all of the software required for our app on F: (to big for the OS disk and besides sticking it on the OS disk seems messy)
Build an image from PBIMaster call it PBIMasterImage save it.
Create a new image from PBIMaster call it Node1
Copy PBIMasterDisk to a new Azure disk call it Node1Software disk
Attach Node1Softwaredisk to Node1 as F:
Since the image has the correct registry settings from the previous installs our stack is ready to go.
9 Add appropriate endpoints.
Rinse and repeat for each additional node.
Hopefully that makes our scenario clearer.
Thanks.
If I understood your objective correctly you already have uploaded two VHD in your subscription and you have also create a VM based on your OS Disk VHD1:
OS Disk (VHD1)
Data Disk (VHD2)
Now you want to copy VHD2 to VHD3 and then attach VHD3 to your VM (which is based on OS disk) via Powershell.
As of there is no powershell command which will let you copy DataDisk (VHD2) to another data disk (i.e VHD3)..
I haven't tried but you can use the following code to try copying your DataDisk:
http://blogs.msdn.com/b/windowsazurestorage/archive/2012/06/12/introducing-asynchronous-cross-account-copy-blob.aspx
This method does copy blobs directly at cloud storage level so there is no bandwidth usage towards on-premise and potentially zero cost if you are in same DC. Trying using the same subscription and see if that solves your problem.