gsutil: why ls returns directory itself

gsutil: why ls returns directory itself - google-cloud-storage

I got
$ gsutil ls gs://ml_models_c/ref7/test/model/2/
gs://ml_models_c/ref7/test/model/2/ <= why this?
gs://ml_models_c/ref7/test/model/2/saved_model.pb
gs://ml_models_c/ref7/test/model/2/variables/
$ gsutil ls gs://seldon-models/tfserving/mnist-model/1/
gs://seldon-models/tfserving/mnist-model/1/saved_model.pb
gs://seldon-models/tfserving/mnist-model/1/variables/
Why there is gs://ml_models_c/ref7/test/model/2/ in the first command output?
Why the second command does not return itself?
It seems that I can rm it.
Thanks

At the API level, Cloud Storage doesn't have the concept of folders, everything is stored as long file names that might have slashes in them.
In this case, you likely have an object named gs://ml_models_c/ref7/test/model/2/, but no object named gs://seldon-models/tfserving/mnist-model/1/
If you don't need the gs://ml_models_c/ref7/test/model/2/ object, you can delete it and it will no longer show in the results for gsutil ls

Related

how to eliminate command exception issue in gsutil

I have this command
gsutil rsync -r -x '".*.jpg$"' File Share\data\Home Drive gs://sdefs01/Home Drive
this is to exclude any .jpg file to be copied to my google bucket.
however, it returns an error:
commandexceptions: the rsync command accept at most 2 arguments.
the command example that I refer to is from google cloud support page.
please help.

You need to put the source directory path inside double quotes as it contains spaces.

How to correctly use `gsutil -q stat` in scripts?

I am creating a KSH script to check whether a subdirectory is exist on GCS bucket. I am writing the script like this:
#!/bin/ksh
set -e
set -o pipefail
gsutil -q stat ${DESTINATION_PATH}/
PATH_EXIST=$?
if [ ${PATH_EXIST} -eq 0 ] ; then
# do something
fi
Weird thing happens when the ${DESTINATION_PATH}/ does not exist, the script exit without evaluating PATH_EXIST=$?. If ${DESTINATION_PATH}/ is exist, the script will run normally as expected.
Why does that thing happen? How can I do better?

The statement set -e implies that your script will be exited if a command exits with a non-zero status.
The gsutil stat command can be used to check wheter an object exists:
gsutil -q stat gs://some-bucket/some-object
It has an exit status of 0 for an existing object and 1 for a non-existent object.
However it is advised against to use it with subdirectories:
Note: Unlike the gsutil ls command, the stat command does not support
operations on sub-directories. For example, if you run the command:
gsutil -q stat gs://some-bucket/some-subdir/
gsutil will look for
information about an object called some-subdir/ (with a trailing
slash) inside the bucket some-bucket, as opposed to operating on
objects nested under gs://some-bucket/some-subdir/. Unless you
actually have an object with that name, the operation will fail.
The reason because your command is not failing when your ${DESTINATION_PATH}/ exists is because if you create the folder using the Cloud Console i.e the UI, then a placeholder object will be created with its name. But let me be clear, folders don't exist in Google Cloud Storage, they are just a visualization of the bucket objects hierarchy.
So if you upload an object named newFolder/object to your bucket and the newFolder does not exists, it will be "created" but your gsutil -q stat ${DESTINATION_PATH}/ will return exit code 1. However if you create the folder using the UI and run the same command it will return exit 0. Thus follow the documentation, and avoid using it for checking if a directory exists.
Instead if you want to check whether a subdirectory exists just check if it contains any object inside:
gsutil -q stat ${DESTINATION_PATH}/*
Which will return 0 if any object is in the subdirectory and 1 otherwise.

Google Cloud Storage : What is the easiest way to update timestamp of all files under all subfolders

I have datewise folders in the form of root-dir/yyyy/mm/dd
under which there are so many files present.
I want to update the timestamp of all the files falling under certain date-range,
for example 2 weeks ie. 14 folders, so that these these files can be picked up by my file-Streaming Data Ingestion process.
What is the easiest way to achieve this?
Is there a way in UI console? or is it through gsutil?
please help

GCS objects are immutable, so the only way to "update" the timestamp would be to copy each object on top of itself, e.g., using:
gsutil cp gs://your-bucket/object1 gs://your-bucket/object1
(and looping over all objects you want to do this to).
This is a fast (metadata-only) operation, which will create a new generation of each object, with a current timestamp.
Note that if you have versioning enabled on the bucket doing this will create an extra version of each file you copy this way.

When you say "folders in the form of root-dir/yyyy/mm/dd", do you mean that you're copying those objects into your bucket with names like gs://my-bucket/root-dir/2016/12/25/christmas.jpg? If not, see Mike's answer; but if they are named with that pattern and you just want to rename them, you could use gsutil's mv command to rename every object with that prefix:
$ export BKT=my-bucket
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/01/15/file1.txt
gs://my-bucket/2016/01/15/some/file.txt
gs://my-bucket/2016/01/15/yet/another-file.txt
$ gsutil -m mv gs://$BKT/2016/01/15 gs://$BKT/2016/06/20
[...]
Operation completed over 3 objects/12.0 B.
# We can see that the prefixes changed from 2016/01/15 to 2016/06/20
$ gsutil ls gs://$BKT/**
gs://my-bucket/2015/12/31/newyears.jpg
gs://my-bucket/2016/06/20/file1.txt
gs://my-bucket/2016/06/20/some/file.txt
gs://my-bucket/2016/06/20/yet/another-file.txt

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.

I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

using wget to overwrite file but use temporary filename until full file is received, then rename

I'm using wget in a cron job to fetch a .jpg file into a web server folder once per minute (with same filename each time, overwriting). This folder is "live" in that the web server also serves that image from there. However if someone web-browses to that page during the time the image is being fetched, it is considered a jpg with errors and says so in the browser. So what I need to do is, similar to when Firefox is downloading a file, wget should write to a temporary file, either in /var or in the destination folder but with a temporary name, until it has the whole thing, then rename in an atomic (or at least negligible-duration) step.
I've read the wget man page and there doesn't seem to be a command line option for this. Have I missed it? Or do I need to do two commands in my cron job, a wget and a move?

There is no way to do this purely with GNU Wget.
wget's job is to download files and it does that. A simple one line script can achieve what you're looking for:
$ wget -O myfile.jpg.tmp example.com/myfile.jpg && mv myfile.jpg{.tmp,}
Since mv is atomic, atleast on Linux, you get the atomic update of a ready file.

Just wanted to share my solution:
alias wget='func(){ (wget --tries=0 --retry-connrefused --timeout=30 -O download_pkg.tmp "$1" && mv download_pkg.tmp "${1##*/}") || rm download_pkg.tmp; unset -f func; }; func
it creates a function that receives a parameter "url" to download the file to a temporary name. If it is successful, it is renamed to the correct filename extracted from parameter $1 with ${1##*/}. and if it fails, deletes the temp file. If the operation is aborted, the temp file will be replace on the next run. after all, unset -f removes the function definition as the alias is executed.