Docker workflow for scientific computing - workflow

I'm trying to imagine a workflow that could be applied on a scientific work environment. My work involves doing some scientific coding, basically with Python, pandas, numpy and friends. Sometimes I have to use some modules that are not common standards in the scientific community and sometimes I have to integrate some compiled code in my chain of simulations. The code I run is most of the time parallelised with IPython notebook.
What do I find interesting about docker?
The fact that I could create a docker containing my code and its working environment. I can then send the docker to my colleges, without asking them to change their work environment, e.g., install an outdated version of a module so that they can run my code.
A rough draft of the workflow I have in mind goes something as follows:
Develop locally until I have a version I want to share with somebody.
Build a docker, possibly with a hook from a git repo.
Share the docker.
Can somebody give me some pointers of what I should take into account to develop further this workflow? A point that intrigues me: code running on a docker can lunch parallel process on the several cores of the machine? e.g., an IPython notebook connected to a cluster.

Docker can launch multiple process/thread on multiple core. Multiple processes may need the use of a supervisor (see : https://docs.docker.com/articles/using_supervisord/ )
You should probably build an image that contain the things you always use and use it as a base for all your project. (Would save you the pain of writing a complete Dockerfile each time)
Why not develop directly in a container and use the commit command to save your progress on a local docker registry? Then share the final image to your colleague.
How to make a local registry : https://blog.codecentric.de/en/2014/02/docker-registry-run-private-docker-image-repository/

Even though you'll have a full container, I think a package manager like conda can still be a solid part of the base image for your workflow.
FROM ubuntu:14.04
RUN apt-get update && apt-get install curl -y
# Install miniconda
RUN curl -LO http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
RUN bash Miniconda-latest-Linux-x86_64.sh -p /miniconda -b
RUN rm Miniconda-latest-Linux-x86_64.sh
ENV PATH=/miniconda/bin:${PATH}
RUN conda update -y conda
* from nice example showing docker + miniconda + flask
Wrt doing source activate <env> in the Dockerfile you need to:
RUN /bin/bash -c "source activate <env> && <do something in the env>"

Related

Install MongoDB on Manjaro

I'm facing difficulties installing the MongoDB community server on Manjaro Linux.
There isn't official documentation on how to install it on Arch-based systems and Pacman can't find it in the AUR repos.
Has anyone ever tried to install it?
Here is what I did to install.
As the package is not available in the official Arch repositories and can't be installed using pacman, you need to follow a few steps to install it.
First, you need to get the URL for the repo of prebuilt binaries from AUR. It can be found here and by the time of writing this it was https://aur.archlinux.org/mongodb-bin.git
Simply clone the repo in your home directory or anywhere else. Do git clone https://aur.archlinux.org/mongodb-bin.git, then head to the cloned directory, cd mongodb-bin.
Now, all you need to do is to run makepkg -si command to make the package. the -s flag will handle the dependencies for you and the -i flag will install the package.
After makepkg finishes its execution, don't forget to start mongodb.service. Run systemctl start mongodb and if needed enable it with systemctl enable mongodb.
Type mongo in the terminal and if the Mongo Shell runs you are all set.
Later edit (8.2.2021): This package is now available in AUR.
It is available in AUR, so you can view it with pamac with -a flag,
eg.
pamac search -a mongodb-bin
pamac info -a mongodb-bin
And, then build and install with (this can be done after manually cloning too) -
pamac build mongodb-bin
Note that there's also a package named mongodb, but mongodb-bin is a newer release (you can check the version numbers by search or info arguments)
I've been using mongodb via docker for a couple of years.
In my experience, it's easier than installing the regular way. (assuming you already have docker installed)
1. Ensure you have docker installed
If you don't already have it, you can install via pacman/pamac, because it's in the official Arch/Manjaro package repositories. The easiest way is to run the following command:
sudo pacman -S docker
2. Run a single docker command
sudo docker run -d -p 27017:27017 -v ~/mongodb_data:/data/db mongo
This command will run mongodb on a port 27017, and place its data files into a folder ~/mongodb_data.
If you're running this command for the first time, it will also download all the required files.
Now you're successfully running a local instance of mongodb, and you can connect it with your favorite db management tool or from your code.

nvidia-docker - can cuda_runtime be available while building a container?

While attempting to compile darknet in the build command of a docker container I constantly run into the exception include/darknet.h:11:30: fatal error: cuda_runtime.h: No such file or directory.
I am building the container from the instructions here: https://github.com/NVIDIA/nvidia-docker/wiki/Deploy-on-Amazon-EC2. I have a simple Dockerfile I am testing with - the relevant parts:
FROM nvidia/cuda:9.2-runtime-ubuntu16.04
...
WORKDIR /
RUN apt-get install -y git
RUN git clone https://github.com/pjreddie/darknet.git
WORKDIR /darknet
# Set OpenCV makefile flag
RUN sed -i '/OPENCV=0/c\OPENCV=1' Makefile
RUN sed -i '/GPU=0/c\GPU=1' Makefile
#RUN ln -s /usr/local/cuda-9.2 /usr/local/cuda
# HERE I have been playing with commands to show me the state of the docker image to try to troubleshoot the problem
RUN find / -name "cuda_runtime.h"
RUN ls /usr/local/cuda/lib64/
RUN less /usr/local/cuda/README
RUN make
Most of the documentation I see references using the nvidia libraries when running a container, but the darknet compiles differently when built with gpu support so I need cuda_runtime.h available at build time.
Perhaps I misunderstand what nvidia-docker is doing - I'm assuming that nvidia-docker exists because the Nvidia code must be installed on the actual host machine and not inside the container & they use some mechanism to share the "native" code with the containers so the GPU can be managed - is that correct?
Should I even be trying to build darknet when building my container or should I be installing it on the host machine, then making it available somehow to the container? This seems to go against the portability of the containers but I can live with some constraints to get access to the GPU.
FROM nvidia/cuda:9.2-runtime-ubuntu16.04
Your image only has bits and pieces of CUDA-9.2 needed to run a CUDA app, but does not have the bits needed to build one.
You need to use -devel variant.

How to run OWASP Zed Attack Proxy ZAP's zap-api-scan.py without requiring docker

Is there a way to run zap-api-scan.py outside of docker?
I tried the below to run this python script outside of docker with below steps successfully. However, the script itself checks if it is running in docker and initiates docker via zap api if it is not running in docker.
git clone https://github.com/zaproxy/zaproxy.git
easy_install six
pip install python-owasp-zap-v2.4
pip uninstall chardet
pip install "chardet==3.0.2"
python zaproxy/docker/zap-api-scan.py
Answered on the ZAP User Group: https://groups.google.com/d/msg/zaproxy-users/ITE1W4V0H1Y/UFO6teGrBwAJ
Basically you just need to edit or comment out the parts that start ZAP in docker and ensure that your ZAP instance is configured in the same way the script sets ZAP up.

How should I handle Perl module updates when maintaining docker images?

I'm working on building a docker image to be able to run all of our Perl applications. The applications require hundreds of CPAN modules to be installed. The full build of the docker image takes about an hour to complete.
After doing the initial image, I'm not sure how best to handle ongoing updates.
We could keep a single Dockerfile in git, and then modify this as required, and push new builds up to dockerhub. However if the person doing the build doesn't have all of the intermediate images, then adding a single CPAN module could be an extremely tedious process, and it might take an hour before they even know if the new module installs correctly. Also it would be downloading every CPAN module again, which seems a bit risky, as there might be a breaking change in the new module.
Alternatively, the person doing the build could pull the latest docker-hub image, and then install the cpan module interactively, commit the build and push the new image to dockerhub. However then we only have our dockerhub images, but not master Dockerfile.
Or another option would be to create a Dockerfile for each new build, which references the previous dockerhub image. This seems overly complicated though.
Option 1) seems wrong. I'm fairly sure we don't want to be rebuilding the entire image from the base OS just to install one additional module. However being dependent on images without Dockerfiles seems risky as well.
You could use the standard module installer for your underlying OS on your docker image.
For example, if its RedHat then use yum and only use CPAN when they are not available
FROM centos:centos7
RUN yum -y install cpanm gcc perl perl-App-cpanminus perl-Config-Tiny && yum clean all
RUN cpanm install Some::Module; rm -fr root/.cpanm; exit 0
taken from here and modified
I would try to have a base image which the actual applications use
I would also avoid doing things interactively (e.g. script a dockerfile) as you want to be able to repeat the build when upstream dependencies change, which docker hub does for you.
EDIT
You can convert perl modules into your own packages using dh-make-perl
You can load these into your own Ubuntu repo using reprepro or a paid solution of Artifactory
These can then be installed using apt-get when you use your repo as a source from within a dockerfile.
When I have tried a similar thing before There are a few problems
Your apps don't work with the latest version of modules
There are far more dependencies than you expected
Some modules wont package
Benefits are
You keep the build tools (gcc, etc) off the app servers
You know much more about your dependencies

Why does Capistrano need modifications to use something like pythonbrew?

As I understand, all that Capistrano does is ssh into the server and execute the commands we want it to (mostly).
I've used rvm in some past couple of projects, and had to install the rvm-capistrano gem. Otherwise, it failed to find the executables (or so I recall), even though we had a proper .rvmrc file (with the correct ruby and the correct gemset) in the repository.
Similarly, today I was setting up deployment for a project for which I'm using pythonbrew, and a simple "cd #{deploy_to}/current && pythonbrew venv use myenv && gunicorn_django -c gunicorn.py" gave me an error message saying "cannot find the executable gunicorn_django". This, I suppose is because the virtualenv was not activated correctly. But didn't we activate the environment when we did "pythonbrew venv use myenv"? The complete command works fine if I ssh into the server and execute it on the shell, but it doesn't when I do it via Capistrano.
My question is - why does Capistrano need modifications to play along with programs like rvm and pythonbrew, even though all it's doing is executing a couple of commands over ssh?
Thats because their ssh'ing in doesn't activate your shell's environment. So it's not picking up the source statements that enable the magic. Just do an rvm use ... before running commands instead of assuming the cd will pick that up automatically. Should be fine then. If you had been using fabric there is the env() context manager that you could use to be sure thats run before each command.