Mounting Page Blob as a VHD in a batch file - mongodb

This is a follow-up to my stackoverflow post: how do I mount a page blob as a VHD on worker role instance? After the drive is mounted, I will pass that as the value of --dbpath parameter to mongo instance.
In a nutshell, I'm trying to start a single mongo instance with the data directory on azure blob (for durability). I'm building on the HelloWorld example on Azure's site-- instead of starting Tomcat instance, I will start mongo instance.

I suggest you follow this guide: http://www.codeproject.com/Articles/81413/Windows-Azure-Drives-Part-1-Configure-and-Mounting. This guide explains how to mount the drive but it also shows how you can save the drive letter as an environment variable.
This is interesting for when you're starting the mongo instance, you can just use this environment variable together with --dbpath. Maybe it would be best to encapsulate all the code in a console application so that you can simply start it before starting the mongo instance.

I’m not sure if you can mount a drive in Java. Currently this feature is not available in Windows Azure Storage Client for Java: https://github.com/WindowsAzure/azure-sdk-for-java. There’s no native (C++) API either. So you may need to use .NET to mount the drive, and then start your Java process from your .NET application. For now, you can also submit a feature request on http://www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting.
Best Regards,
Ming Xu.

Related

MLflow Artifacts Storing artifacts(google cloud storage) but not displaying them in MLFlow UI

I am working on a docker environment(docker-compose) with a jupyter notebook docker image and a postgres docker image for running ML models and using google cloud storage to store the model artifacts. Storing the models on the cloud storage works fine but i can't get to show them within the MLFlow UI. I have seen similar problems but non of the solutions used google cloud storage as the storage location for artifacts. The error message says the following Unable to list artifacts stored under <gs-location> for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory.What could possibly be causing this problem?
I had the exactly the same issue. Keywords are docker-compose, google cloud storage, success in storing in GCS, but failure in listing artifacts in UI.
In my case, it turns out that in docker-compose file, if you assign the env vars by reading from a .env file (eg. GOOGLE_APPLICATION_CREDENTIALS), the server might start before the assignment. The quick solve is to assign the env var directly with key environment: instead of using key env_file:.
For sensitive data that you still need to put in .env file, you can add wait time for the server, and add depends on: in docker-compose file to make sure that the database container starts before the mlflow server if you are using database-backed store.
I faced a same issue when running mlflow from local. The issue got resolved after adding GOOGLE_APPLICATION_CREDENTIALS to the environment variables.
https://googleapis.dev/python/google-api-core/latest/auth.html

Transfer MongoDB dump on external hard drive to google cloud platform

As a part of my thesis project, I have been given a MongoDB dump of size 240GB which is on my external hard drive. I'll have to use this data to run my python scripts for a short duration. However, since my dataset is huge and I cannot mongoimport on my local mongodb server (since I don't have enough internal memory), my professor gave me a $100 google cloud platform coupon so I can use the google cloud computing resources.
So far I have researched that I can do it this way:
Create a compute engine in GCP and install mongodb on remote engine. Transfer the MongoDB dump to remote instance and run the scripts to get the output.
This method works well but I'm looking for a method to create a remote database server in GCP so I that I can run my scripts locally, which is something like one of the following.
Creating a remote mongodb server on GCP so that I can establish a remote mongo connection to run my scripts locally.
Transferring the mongodb dump to google's datastore so then I can use the datastore API to remotely connect and run my scripts locally.
I have given a thought of using MongoDB atlas but because of the size of the data, I will be billed hugely and I cannot use my GCP coupon.
Any help or suggestions on how of either of the two methods can be implemented is appreciated.
There is 2 parts to your question
First, you can create a compute engine VM with MongoDB installed and load your backup on it. Then, open the right firewall rules for allowing the connexion from your local environment to the Google Compute Engine VM. The connexion will be performed with a simple login/password.
You can use a static IP on your VM. By the way, in case of reboot on the VM you will keep the same IP (and it will be easier for your local connexion).
Second, BE CAREFUL to datastore. It's a good product, serverless NoSQL database, document oriented, but it's absolutely not the MongoDB equivalent. You can't perform aggregate, you are limited in search capabilities,... It's designed for specific use case (I don't know yours, but don't think that is the MongoDB equivalent!).
Anyway, if you use Datastore, you will have to use a service account or to install Google Cloud SDK on your local environment to be authenticated and to be able to request Datastore API. No login/password in this case.

Accessing streamsets web UI on another node in a cluster than where installed, which file system does it 'look in'?

I have a cluster of machines hosting hadoop (MapR) and have install streamsets on one of the nodes (say node002) following the RPM documentation. However, I am accessing the web UI for the data collector from another node, node001.
My question is, when I specify files paths (eg. an origin directory), which file system is the web UI going to be referring to? Eg. if I put an origin directory as /home/myuser/mydata, will the pipeline created in the web UI be looking for that directory in node001 or node002? New to using streamsets, so a more detailed answer would be appreciated. Thanks.
** Ultimately I am asking this because I am currently getting "FileNotFound" and "permission denied" errors while trying to follow the documentation's tutorial and am trying to debug the situation.
From the streamsets community forums: It will be the path to the local file on the machine running that particular SDC instance.
The FileNotFound and permission errors have to do with the fact that the default user for the sdc service is a user called sdc. Still working on how to fix this part, but can produce a workable prototype by setting the read and write access for the directories in question to allow public access (still need to work on this part, but this answers the posted question).

Data written with gsutil is not visible with gcsfuse

I have installed gcsfuse to support an app requiring a posix-like mount point.
Existing data written with gsutil is not visible, but data written via the browser (Cloud Storage > Storage Browser) is.
According to https://cloud.google.com/storage/docs/gcsfuse -
You can simultaneously read and write to Google Cloud Storage using the Fuse Adapter and tools like gsutil. For example, if you write an object using the Fuse Adapter, it will immediately be available to read with gsutil, or vice versa, without the need to re-mount the bucket or reboot the Compute Engine instance.
Has anyone been successful collaborating with gcsfuse and gsutil?
I feel like I'm missing something.
Thanks!
This is likely because gsutil doesn't create directory placeholder objects, and gcsfuse by default requires them in order for a directory to be visible. To confirm: when you write an object with gsutil in a directory that you can already see (e.g. the root), does it show up?
You can work around this in one of two ways:
Create the directory placeholders for the directories you're missing. The easiest way to do this for a missing object foo/bar/baz is using a gcsfuse mount:
mkdir -p foo/bar
Run gcsfuse with the --implicit-dirs flag. Make sure to read the documentation linked above for caveats, though.

Docker and sensitive information used at run-time

We are dockerizing an application (written in Node.js) that will need to access some sensitive data at run-time (API tokens for different services) and I can't find any recommended approach to deal with that.
Some information:
The sensitive information is not in our codebase, but it's kept on another repository in encrypted format.
On our current deployment, without Docker, we update the codebase with git, and then we manually copy the sensitive information via SSH.
The docker images will be stored in a private, self-hosted registry
I can think of some different approaches, but all of them have some drawbacks:
Include the sensitive information in the Docker images at build time. This is certainly the easiest one; however, it makes them available to anyone with access to the image (I don't know if we should trust the registry that much).
Like 1, but having the credentials in a data-only image.
Create a volume in the image that links to a directory in the host system, and manually copy the credentials over SSH like we're doing right now. This is very convenient too, but then we can't spin up new servers easily (maybe we could use something like etcd to synchronize them?)
Pass the information as environment variables. However, we have 5 different pairs of API credentials right now, which makes this a bit inconvenient. Most importantly, however, we would need to keep another copy of the sensitive information in the configuration scripts (the commands that will be executed to run Docker images), and this can easily create problems (e.g. credentials accidentally included in git, etc).
PS: I've done some research but couldn't find anything similar to my problem. Other questions (like this one) were about sensitive information needed at build-time; in our case, we need the information at run-time
I've used your options 3 and 4 to solve this in the past. To rephrase/elaborate:
Create a volume in the image that links to a directory in the host system, and manually copy the credentials over SSH like we're doing right now.
I use config management (Chef or Ansible) to set up the credentials on the host. If the app takes a config file needing API tokens or database credentials, I use config management to create that file from a template. Chef can read the credentials from encrypted data bag or attributes, set up the files on the host, then start the container with a volume just like you describe.
Note that in the container you may need a wrapper to run the app. The wrapper copies the config file from whatever the volume is mounted to wherever the application expects it, then starts the app.
Pass the information as environment variables. However, we have 5 different pairs of API credentials right now, which makes this a bit inconvenient. Most importantly, however, we would need to keep another copy of the sensitive information in the configuration scripts (the commands that will be executed to run Docker images), and this can easily create problems (e.g. credentials accidentally included in git, etc).
Yes, it's cumbersome to pass a bunch of env variables using -e key=value syntax, but this is how I prefer to do it. Remember the variables are still exposed to anyone with access to the Docker daemon. If your docker run command is composed programmatically it's easier.
If not, use the --env-file flag as discussed here in the Docker docs. You create a file with key=value pairs, then run a container using that file.
$ cat >> myenv << END
FOO=BAR
BAR=BAZ
END
$ docker run --env-file myenv
That myenv file can be created using chef/config management as described above.
If you're hosting on AWS you can leverage KMS here. Keep either the env file or the config file (that is passed to the container in a volume) encrypted via KMS. In the container, use a wrapper script to call out to KMS, decrypt the file, move it in to place and start the app. This way the config data is not exposed on disk.