GCS - Global Consistency with delete + rename

GCS - Global Consistency with delete + rename - google-cloud-storage

My issue may be a result of my misunderstanding with global consistency in google storage, but since I have not experienced this issue until just recently (mid November) and now it seems easily reproducible, I wanted some clarification. The issue started happening in a piece of spark code running on compute engine using bdutil but I can reproduce from the command line with gsutil.
My code is deleting a destination path and then immediately renaming a source path as the destination path. With global consistency I would expect since the destination path no longer exists, the src would be renamed to the destination, but instead the src is being nested inside destination as if the destination still exists and that is not consistent.
The hadoop code to reproduce looks like:
fs.delete(new Path(dest), true)
fs.rename(new Path(src), new Path(dest))
From command line I can reproduce with:
gsutil -m rm -r gs://mybucket/dest
gsutil -m cp -r gs://mybucket/src gs://mybucket/dest
If the reason is because list operations are eventually consistent and the FileSystem implementation is using list operations to determine if the destination still exists, then I understand, and then is there a recommended solution to ensure the destination no longer exists before renaming?
Thanks,
Luke

Travis's answer is a couple of years old and not true anymore. Object list operation is strongly consistent now. Read Google's post.

Read-after-write (including delete) operations are strongly consistent, so for example, if you did:
gsutil -m rm -r gs://mybucket/dest
# Command output shows it removed gs://mybucket/dest/file1
gsutil cp gs://mybucket/dest/file1 my_local_dir/file1
That would always fail.
However, to determine if a "directory" exists, gsutil must perform an eventually-consistent listing operation to find out if any object in Google Cloud Storage's flat name space has a prefix with the name of that "directory". This can lead to the problem you described, and I expect that the hadoop code behaves similarly.
There isn't a strongly consistent workaround for this problem because there's no way to check for the existence of a prefix in a strongly-consistent way.

Related

How to download all bucket files. (The issue with the -m flag gsutil)

I am trying to copy all files from cloud storage bucket recursively and I am having problem with the -m flag as I have investigated.
The command that I am running
gsutil -m cp -r gs://{{ src_bucket }} {{ bucket_backup }}
I am getting something like this:
CommandException: 1 file/object could not be transferred.
where the number of files/objects differs every time.
After investigation I have tried to reduce number of threads/processes which used with the -m option, but this has not helped, so I am looking for some advice about this. I have 170 MiB data on the bucket which is approximately 300k files. I need to download them as fast as possible
UPD:
Logs with -L flag
[Errno 2] No such file or directory: '<path>/en_.gstmp' -> '<path>/en'
6 errors like that.

The root of the issue might be that both directory and file of the same name exist in the GCS bucket. Try executing the command with -L flag, so you will get additional logs on the execution and you will be able to find the file that is causing this error.
I would suggest you delete that file and make sure there is no directory in the bucket of that name and then upload this file to the bucket again.
Also check if any of the directory created with Jar name. Delete them and processed the copy files.
And check if the required file is already at destination and delete the file at destination and execute copy again.
There are alternatives to copy, for example, it is possible to transfer files using rsync, as described here.
You can also check similar threads: thread1 , thread2 & thread3

Snakemake: how to realize a mechanism to copy input/output files to/from tmp folder and apply rule there

We use Slurm workload manager to submit jobs to our high performance cluster. During runtime of a job, we need to copy the input files from a network filesystem to the node's local filesystem, run our analysis there and then copy the output files back to the project directory on the network filesystem.
While the workflow management system Snakemake integrates with Slurm (by defining profiles) and allows to run each rule/step in the workflow as Slurm job, I haven't found a simple way to specify for each rule, wether a tmp folder should be used (with all the implications stated above or not.
I am very happy for simple solutions how to realise this behaviour.

I am not entirely sure if I understand correctly. I am guessing you do not want to copy the input of each rule to a certain directory, do the rule, then copy the output back to another filesystem, since that would be a lot of unnecessary files moving around. So for the first half of the answer I assume before execution you move your files to /scratch/mydir.
I believe you could use the --directory command (https://snakemake.readthedocs.io/en/stable/executing/cli.html). However I find this works poorly, since then snakemake has difficulty finding the config.yaml and samples.tsv.
The way I solve this is just by adding a working dir in front of my paths in each rule...
rule example:
input:
config["cwd"] + "{sample}.txt"
output:
config["cwd"] + "processed/{sample}.txt"
shell:
"""
touch {output}
"""
So all you then have to do is change cwd in your config.yaml.
local:
cwd: ./
slurm:
cwd: /scratch/mydir
You would then have to manually copy them back to your long-term filesystem or make a rule that would do that for you.
Now if however you do want to copy your files from filesystem A -> B, do your rule, and then move the result from B -> A, then I think you want to make use of shadow rules. I think the docs properly explain how to use that so I just give a link :).

Nearline - Backup Solution - Versioning

I've setup some Nearline buckets and enabled versioning and object lifecycle management. The use-case is to replace my current backup solution, Crashplan.
Using gsutil I can see the different versions of a file using a command like gsutil ls -al gs://backup/test.txt.
First, is there any way of finding files that don't have a live version (e.g. deleted) but still have a version attached?
Second, is there any easier way of managing versions? For instance if I delete a file from my PC, it will no longer have a live version in my bucket but will still have the older versions associated. Say, if I didn't know the file name would I just have to do a recursive ls on the entire bucket and sift through the output?
Would love a UI that supported versioning.
Thanks.

To check if the object currently has no life version use x-goog-if-generation-match header equal to 0, for example :
gsutil -h x-goog-if-generation-match:0 cp file.txt gs://bucket/file.txt
will fail (PreconditionException: 412 Precondition Failed) if file has a live version and will succeed if it has only archived versions.
In order to automatically synchronize your local folder and folder in the bucket (or the other way around) use gcloud rsync:
gcloud rsync -r -d ./test gs://bucket/test/
notice the trailing / in gs://bucket/test/, without it you will receive
CommandException: arg (gs://graham-dest/test) does not name a directory, bucket, or bucket subdir.
-r synchronize all the directories in ./test recursively to gs://bucket/test/`
-d will delete all files from gs://bucket/test/that are not found in./test`
Regarding UI, there already exists a future request. I don't know anything about third party applications however.

Google Cloud Storage : What is the easiest way to update timestamp of all files under all subfolders

I have datewise folders in the form of root-dir/yyyy/mm/dd
under which there are so many files present.
I want to update the timestamp of all the files falling under certain date-range,
for example 2 weeks ie. 14 folders, so that these these files can be picked up by my file-Streaming Data Ingestion process.
What is the easiest way to achieve this?
Is there a way in UI console? or is it through gsutil?
please help

GCS objects are immutable, so the only way to "update" the timestamp would be to copy each object on top of itself, e.g., using:
gsutil cp gs://your-bucket/object1 gs://your-bucket/object1
(and looping over all objects you want to do this to).
This is a fast (metadata-only) operation, which will create a new generation of each object, with a current timestamp.
Note that if you have versioning enabled on the bucket doing this will create an extra version of each file you copy this way.

When you say "folders in the form of root-dir/yyyy/mm/dd", do you mean that you're copying those objects into your bucket with names like gs://my-bucket/root-dir/2016/12/25/christmas.jpg? If not, see Mike's answer; but if they are named with that pattern and you just want to rename them, you could use gsutil's mv command to rename every object with that prefix:
$ export BKT=my-bucket
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/01/15/file1.txt
gs://my-bucket/2016/01/15/some/file.txt
gs://my-bucket/2016/01/15/yet/another-file.txt
$ gsutil -m mv gs://$BKT/2016/01/15 gs://$BKT/2016/06/20
[...]
Operation completed over 3 objects/12.0 B.
# We can see that the prefixes changed from 2016/01/15 to 2016/06/20
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/06/20/file1.txt
gs://my-bucket/2016/06/20/some/file.txt
gs://my-bucket/2016/06/20/yet/another-file.txt

Possible issue with international characters in objects and/or paths when copying recursively

I've run into a weird problem after uploading a lot of images with gsutil - the uploaded files cannot be seen via the Google Cloud Console and gsutil itself complains if I try to do a 'gsutil ls '. I am 99% sure it is related to the use of "å" or "Å" together with spaces in the directory name.
All uploads were done recursively from a root folder (large image collection in multiple levels of subdirectories). If I try to upload the files again, gsutil skips them since they are already there, so the upload feature does something - it just isn't working in the same way as the list and download.
An example:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Skipping existing item: gs://digitalfotografen/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2
...
OK - so the files are there, but browsing the directory through the Google Cloud Console shows "No results".
Also:
gsutil ls gs://digitalfotografen/2009/2009-06-27 Søgården - reklamefotos/20090627_IMG_0128.CR2
CommandException: "ls" command does not support "file://" URIs. Did you mean to use a gs:// URI?
I tried escaping spaces and used quotation marks in different ways with no luck.
Now, here is the interesting thing:
gsutil cp -R -n /Volumes/Photos/digitalfotografen.dk/2009/2009-05-30\ Søgården\ -\ bryllup/ gs://digitalfotografen/2009/
Copying file:///Volumes/Photos/digitalfotografen.dk/2009/2009-05-30 Søgården - bryllup/Søgården 0128.CR2 [Content-Type=application/octet-stream]...
Here I copied the folder specifically with escaped spaces on the source side, and now the files are uploaded again. This creates a second folder with the same name (at least it appears so in the Cloud Console) and the files are now visible in both folders.
We use three different characters that are outside the standard US ASCII in the Danish character set ("æøå" and the capital "ÆØÅ") but the problem only seems to affect "å" and "Å" - the two others alone or in combination works fine. My hunch is that "å" and "Å" may translate into something entirely different in ASCII that throws things off track when gsutil is allowed to handle the directory naming on its own based on the name of the root folder (doing a multiple level recursion) but works when the user specifies the escaped name of the root folder.
This may be a python issue rather than a gsutil issue, but I am in no way qualified to identify this since I have very close to zero knowledge of programming outside a bit of hodgepodge shell scripts.

We got a trouble with gsutil into ubuntu wsl version windows 10.
The command gsutil work perfectly into the shell but not working when is included into a shell script:
gsutil -m ls -lr gs://project.appspot.com/
Error:
commandexception: "ls" command does not support "file://" urls. did you mean to use a gs:// url?
A workaround cloud be by calling directly the script /usr/lib/google-cloud-sdk/platform/gsutil/gsutil and not calling the link /usr/bin/gsutil:
/usr/lib/google-cloud-sdk/platform/gsutil/gsutil -m ls -lr gs://project.appspot.com/
I don't know why but it's working.
Thank Marion to provide us a such uncommon bug :-)

I know this here is a old error, but nevertheless I had a similar issue as described above.
CommandException: "ls" command does not support "file://" URLs. Did you mean to use a gs:// URL?
Using gsutil from scala code.
import sys.process._
object Main {
def main(args: Array[String]): Unit = {
val clients = s"gsutil ls gs://<bucket name>".!!
val beforeDate: String = "date +%Y-%m-%d -d '-8 days'".!!
val clientList = clients.split("\n").map(f => f.split('/').apply(1)).toList
for (x <- clientList) {
val countImg = (s"gsutil -m ls gs://<bucket name>/$x/${beforeDate.stripLineEnd}" #| "wc -l").!!
println(countImg)
}
}
}
So what I found was that there was a LineEnd character on the beforeDate, when I striped that the error went away. So the error occurs when there is a "special" character in the gs://... path. So be sure to strip variables for any "special" characters.
And all this happened just because I was to lazy to use java.time.LocalDate to generate the beforeDate variable. Hope this here will help others that encounter the same error.