Data written with gsutil is not visible with gcsfuse - google-cloud-storage

I have installed gcsfuse to support an app requiring a posix-like mount point.
Existing data written with gsutil is not visible, but data written via the browser (Cloud Storage > Storage Browser) is.
According to https://cloud.google.com/storage/docs/gcsfuse -
You can simultaneously read and write to Google Cloud Storage using the Fuse Adapter and tools like gsutil. For example, if you write an object using the Fuse Adapter, it will immediately be available to read with gsutil, or vice versa, without the need to re-mount the bucket or reboot the Compute Engine instance.
Has anyone been successful collaborating with gcsfuse and gsutil?
I feel like I'm missing something.
Thanks!

This is likely because gsutil doesn't create directory placeholder objects, and gcsfuse by default requires them in order for a directory to be visible. To confirm: when you write an object with gsutil in a directory that you can already see (e.g. the root), does it show up?
You can work around this in one of two ways:
Create the directory placeholders for the directories you're missing. The easiest way to do this for a missing object foo/bar/baz is using a gcsfuse mount:
mkdir -p foo/bar
Run gcsfuse with the --implicit-dirs flag. Make sure to read the documentation linked above for caveats, though.

Related

Default UNIX permissions of Mongodb files in the hard drive

I noticed that the files in the data/ directory, hosting the databases and collections, are the r permission for others.
So basically, anyone can read the data! Isn't it strange, or is it something I'm missing?
I found no solution to change this behavior in the mondodb configuration (ubuntu 18.04). When you search mongodb file permissions, you will find threads about user permissions inside the database.
Thank you!
Im going to assume you're using WiredTiger, the default storage engine for mongo. Either way, the same concept applies.
You'll see the .wt files (the ones you're talking about), although readable by permission, are not very readable to the eye. Try look for yourself with less <example>.wt.
They're stored in a specific format, with compression and some encryption. Realistically, they shouldn't be able to be retrieved from outside of your server - and your users in the server should trusted, or given limited access to the locations of these files.
In short, if you apply the proper policies, and keep your actual database and server secure, then this is normal and expected. I hope this makes sense.
When you launch mongod you need to specify a path to the data directory, and this directory must already exist.
You can set the permissions on this directory to deny world-read access by running:
chmod o-rwx /path/to/data/dir
Normally this would be done prior to the first start of mongod.
Once this is done, none of the files in the data directory will be world-readable regardless of their individual permissions.
MongoDB does not need to have a provision to do this because it never creates the data directory.
A different way of accomplishing similar end result is to use umask, but changing permissions on data directory generally would be more reliable.

Symlink to HTTP URL

I am basically trying to move most of my static images to an Amazon S3 bucket but my site still needs to look for the originals through the filesystem in order to generate thumbnails.
/home/user/public_html/upload/2015/*.gif to http://s3.amazonbucketurl/upload/2015/*.gif
I read symlinks can't direct to http. What options do I have?
The correct solution is to modify your application to load what it needs from S3 directly, because you are correct -- symlinks cannot natively reference http destinations.
The work-around is to use a mechanism like s3fs which gives the illusion that your S3 bucket is a mounted filesystem on your server... which you can symlink to.
This is not a genuine and proper solution, because S3 is not really a filesystem (it's an object store) and thus does not precisely follow filesystem semantics. You will not see the same performance you expect from a local filesystem or from accessing S3 directly, natively in your code, because s3fs (of necessity) has extra work to do to try to bridge the impedance gap between "filesystem" and "object store." It does an admirable job, but it is trying to accomplish a task that is conceptually impossible to execute perfectly.
There's no magic bullet solution.

Docker and sensitive information used at run-time

We are dockerizing an application (written in Node.js) that will need to access some sensitive data at run-time (API tokens for different services) and I can't find any recommended approach to deal with that.
Some information:
The sensitive information is not in our codebase, but it's kept on another repository in encrypted format.
On our current deployment, without Docker, we update the codebase with git, and then we manually copy the sensitive information via SSH.
The docker images will be stored in a private, self-hosted registry
I can think of some different approaches, but all of them have some drawbacks:
Include the sensitive information in the Docker images at build time. This is certainly the easiest one; however, it makes them available to anyone with access to the image (I don't know if we should trust the registry that much).
Like 1, but having the credentials in a data-only image.
Create a volume in the image that links to a directory in the host system, and manually copy the credentials over SSH like we're doing right now. This is very convenient too, but then we can't spin up new servers easily (maybe we could use something like etcd to synchronize them?)
Pass the information as environment variables. However, we have 5 different pairs of API credentials right now, which makes this a bit inconvenient. Most importantly, however, we would need to keep another copy of the sensitive information in the configuration scripts (the commands that will be executed to run Docker images), and this can easily create problems (e.g. credentials accidentally included in git, etc).
PS: I've done some research but couldn't find anything similar to my problem. Other questions (like this one) were about sensitive information needed at build-time; in our case, we need the information at run-time
I've used your options 3 and 4 to solve this in the past. To rephrase/elaborate:
Create a volume in the image that links to a directory in the host system, and manually copy the credentials over SSH like we're doing right now.
I use config management (Chef or Ansible) to set up the credentials on the host. If the app takes a config file needing API tokens or database credentials, I use config management to create that file from a template. Chef can read the credentials from encrypted data bag or attributes, set up the files on the host, then start the container with a volume just like you describe.
Note that in the container you may need a wrapper to run the app. The wrapper copies the config file from whatever the volume is mounted to wherever the application expects it, then starts the app.
Pass the information as environment variables. However, we have 5 different pairs of API credentials right now, which makes this a bit inconvenient. Most importantly, however, we would need to keep another copy of the sensitive information in the configuration scripts (the commands that will be executed to run Docker images), and this can easily create problems (e.g. credentials accidentally included in git, etc).
Yes, it's cumbersome to pass a bunch of env variables using -e key=value syntax, but this is how I prefer to do it. Remember the variables are still exposed to anyone with access to the Docker daemon. If your docker run command is composed programmatically it's easier.
If not, use the --env-file flag as discussed here in the Docker docs. You create a file with key=value pairs, then run a container using that file.
$ cat >> myenv << END
FOO=BAR
BAR=BAZ
END
$ docker run --env-file myenv
That myenv file can be created using chef/config management as described above.
If you're hosting on AWS you can leverage KMS here. Keep either the env file or the config file (that is passed to the container in a volume) encrypted via KMS. In the container, use a wrapper script to call out to KMS, decrypt the file, move it in to place and start the app. This way the config data is not exposed on disk.

Using Google Cloud Storage with rsync

I am new to Google Cloud. We have historically used AWS for online backups -- essentially, our local servers ran rsync to an EC2 instance at AWS and it all worked fine. I'm now trying to migrate from AWS to Google and of course the setup is pretty different. With gsutil rsync it looked to me as though I wouldn't need to spin up a Compute Engine at all, I could just push stuff straight into gs://aws_mnt bucket
Having installed the SDK on our AWS instance I was able to push all our backups to the gs://aws_mnt bucket very easily using gsutil cp -n
But going forward I want to run a cron job on the local server which uses rsync rather than cp for obvious reasons.
I have two issues:
Despite reading the appropriate documentation (here) I am so stupid I can't figure out how to permanently authorise the local server so I don't have to do gcloud auth login and get a code from a browser each session, as for a cron job that's not really going to work.
When I try to use gsutil rsync from the local server to the gs://aws_mnt bucket that was pre-populated from AWS, I get an error:
gsutil rsync /mnt/archive/backups gs://aws_mnt/kahless
Building synchronization state...
Skipping cloud sub-directory placeholder object gs://aws_mnt/kahless/
Starting synchronization
There is some discussion of this error on github and I've produced detailed output from
gsutil -D -m rsync /mnt/archive/backups gs://aws_mnt/kahless
But since this is a brand-new install of the SDK I can't imagine the thread hasn't already been dealt with so I must be doing something wrong?
Rus
In response to your questions:
Once you have configured credentials using gcloud auth, the 'gcloud auth login' command will cause them to be selected until you login to a different credential... and that state will persist and not require you to go through the browser session again unless/until you revoke those credentials. Note: If you're thinking of running commands from an unattended script (e.g., via cron) please consider using service account credentials. For more details please see https://developers.google.com/cloud/sdk/gcloud/#gcloud.auth
That "skipping..." message is not an error - it's just informing you that gsutil is skipping trying to download the placeholder object, because such objects aren't needed in (and would interfere with) directories in the local file system. I'll update the message in the next version of gsutil to make this more clear. So, what you saw was that the second run of gsutil rsync found nothing to do after comparing the source and destination, and completed normally.

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.