How to correctly use `gsutil -q stat` in scripts? - google-cloud-storage

I am creating a KSH script to check whether a subdirectory is exist on GCS bucket. I am writing the script like this:
#!/bin/ksh
set -e
set -o pipefail
gsutil -q stat ${DESTINATION_PATH}/
PATH_EXIST=$?
if [ ${PATH_EXIST} -eq 0 ] ; then
# do something
fi
Weird thing happens when the ${DESTINATION_PATH}/ does not exist, the script exit without evaluating PATH_EXIST=$?. If ${DESTINATION_PATH}/ is exist, the script will run normally as expected.
Why does that thing happen? How can I do better?

The statement set -e implies that your script will be exited if a command exits with a non-zero status.
The gsutil stat command can be used to check wheter an object exists:
gsutil -q stat gs://some-bucket/some-object
It has an exit status of 0 for an existing object and 1 for a non-existent object.
However it is advised against to use it with subdirectories:
Note: Unlike the gsutil ls command, the stat command does not support
operations on sub-directories. For example, if you run the command:
gsutil -q stat gs://some-bucket/some-subdir/
gsutil will look for
information about an object called some-subdir/ (with a trailing
slash) inside the bucket some-bucket, as opposed to operating on
objects nested under gs://some-bucket/some-subdir/. Unless you
actually have an object with that name, the operation will fail.
The reason because your command is not failing when your ${DESTINATION_PATH}/ exists is because if you create the folder using the Cloud Console i.e the UI, then a placeholder object will be created with its name. But let me be clear, folders don't exist in Google Cloud Storage, they are just a visualization of the bucket objects hierarchy.
So if you upload an object named newFolder/object to your bucket and the newFolder does not exists, it will be "created" but your gsutil -q stat ${DESTINATION_PATH}/ will return exit code 1. However if you create the folder using the UI and run the same command it will return exit 0. Thus follow the documentation, and avoid using it for checking if a directory exists.
Instead if you want to check whether a subdirectory exists just check if it contains any object inside:
gsutil -q stat ${DESTINATION_PATH}/*
Which will return 0 if any object is in the subdirectory and 1 otherwise.

Related

gsutil: why ls returns directory itself

I got
$ gsutil ls gs://ml_models_c/ref7/test/model/2/
gs://ml_models_c/ref7/test/model/2/ <= why this?
gs://ml_models_c/ref7/test/model/2/saved_model.pb
gs://ml_models_c/ref7/test/model/2/variables/
$ gsutil ls gs://seldon-models/tfserving/mnist-model/1/
gs://seldon-models/tfserving/mnist-model/1/saved_model.pb
gs://seldon-models/tfserving/mnist-model/1/variables/
Why there is gs://ml_models_c/ref7/test/model/2/ in the first command output?
Why the second command does not return itself?
It seems that I can rm it.
Thanks
At the API level, Cloud Storage doesn't have the concept of folders, everything is stored as long file names that might have slashes in them.
In this case, you likely have an object named gs://ml_models_c/ref7/test/model/2/, but no object named gs://seldon-models/tfserving/mnist-model/1/
If you don't need the gs://ml_models_c/ref7/test/model/2/ object, you can delete it and it will no longer show in the results for gsutil ls

LSF moving files into created output dir

When executing a job on LSF you can specify the working directory and create a output directory, i.e
bsub -cwd /home/workDir -outdir /home/$J program inputfile
where it will look for inputfile in the specified working directory. The -outdir will create a new directory based on the JobId.
What I'm wondering is how you pipe the results created from the run in the working directory to the newly created output dir.
You can't add a command like
mv * /home/%J
as the underlying OS has no understanding of the %J identifier. Is there an option in LSF for piping the data inside the job, where it knows the jobId?
You can use the environment variable $LSB_JOBID.
mv * /data/${LSB_JOBID}/
If you copy the data inside your job script then it will hold the compute resource during the data copy. If you're copying a small amount of data then its not a problem. But if its a large amount of data you can use bsub -f so that other jobs can start while the data copy is ongoing.
bsub -outdir "/data/%J" -f "/data/%J/final < bigfile" sh script.sh
bigfile is the file that your job creates on the compute host. It will be copied to /data/%J/final after the job finishes. It even works on a non-shared filesystem.

How to check if any given object exist in google cloud storage bucket through bash

I would like to pragmatically check if object exist at a perticular google cloud storage bucket. Based on object availability i would perform further operations.
I have gone through https://cloud.google.com/storage/docs/gsutil/commands/stat and doc mentioned that "gsutil -q" useful for writing scripts, because the exit status will be 0 for an existing object and 1 for a non-existent object. But when i use command it does not work properly. Please let me know if anyone tried this before?
#!/bin/bash
gsutil -q stat gs://<bucketname>/object
return_value=$?
if [ $return_value != 0 ]; then
echo "folder exist"
else
echo "folder does not exist"
fi
I see that you already have found the answer to your issue, however, I will post this answer here in order to give more context on how the gsutil stat command works and why was your code not working.
gsutil is a Python application that is used for accessing and working with Cloud Storage using the Command Line Interface. It has many commands available, and the one that you used is gsutil stat, which outputs the metadata of an object retrieving the minimum possible data without having to list all the objects in a bucket. This command is also strongly consistent, which makes it an appropriate solution for certain types of applications.
Using the gsutil stat gs://<BUCKET_NAME>/<BUCKET_OBJECT> command, returns something like the following:
gs://<BUCKET_NAME>/<BUCKET_OBJECT>.png:
Creation time: Tue, 06 Feb 2018 14:49:58 GMT
Update time: Tue, 06 Feb 2018 14:49:58 GMT
Storage class: MULTI_REGIONAL
Content-Length: 6119
Content-Type: image/png
Hash (crc32c): <CRC32C_HASH>
Hash (md5): <MD5_HASH>
ETag: <ETAG>
Generation: <TIMESTAMP>
Metageneration: 1
However, if you run it using the -q, it will return 0 if the object exists, or 1 if does not, which makes it interesting for writing scripts such as the one you shared.
Finally, there are some additional considerations that you have to consider when working with subdirectories inside a bucket:
A command such as gsutil -q stat gs://my_bucket/my_subdirectory will retrieve the data of an object called my_subdirectory, not of a directory itself.
A command such as gsutil -q stat gs://my_bucket/my_subdirectory/ will operate over the subdirectory itself, and not over the nested files, so it will just tell you whether the subdirectory exists or not (this is why your code was failing).
You have to use something like gsutil -q stat gs://my_bucket/my_subdirectory/my_nested_file.txt in order to retrieve the metadata of a file nested under a subdirectory.
So, in short, your issue was that you were not making a proper definition of paths. It is not that gsutil is too sensitive in understanding path, but this behavior is working as intended, because you may have the following situation, where you have a file and a folder with the same name, and you should be able to retrieve either of them, thus requiring to specify the / that indicates whether it is a directory or a file:
gs://my_bucket/
|_ my_subdirectory #This is a file
|_ my_subdirectory/ #This is a folder
|_ my_nested_file.txt #This is a nested file
You have the conditional check inverted: exit status 0 means success, i.e., the gsutil stat command found the given object.
Issue is we should use / after object to ensure gsutil -q stat command recognize path properly. If i remove / then it does not work. I am surprise if google is so sensitive in understanding path.
#!/bin/bash
gsutil -q stat gs://<bucketname>/object/
return_value=$?
if [ $return_value = 0 ]; then
echo "folder exist"
else
echo "folder does not exist"
fi

Folders not showing up in Bucket storage

So my problem is that a have a few files not showing up in gcsfuse when mounted. I see them in the online console and if I 'ls' with gsutils.
Also, if If I manually create the folder in the bucket, i then can see the files inside it, but I need to create it first. Any suggestions?
gs://mybucket/
dir1/
ok.txt
dir2
lafu.txt
If I mount mybucket with gcsfuse and do 'ls' it only returns dir1/ok.txt.
Then I'll create the folder dir2 inside dir1 at the root of the mounting point, and suddenly 'lafu.txt' shows up.
By default, gcsfuse won't show a directory "implicitly" defined by a file with a slash in its name. For example if your bucket contains an object named dir/foo.txt, you won't be able to find it unless there is also an object nameddir/.
You can work around this by setting the --implicit-dirs flag, but there are good reasons why this is not the default. See the documentation for more information.
Google Cloud Storage doesn't have folders. The various interfaces use different tricks to pretend that folders exist, but ultimately there's just an object whose name contains a bunch of slashes. For example, "pictures/january/0001.jpg" is the full name of a single object.
If you need to be sure that a "folder" exists, put an object inside it.
#Brandon Yarbrough suggests creating needed directory entries in the GCS bucket. This avoids the performance penalty described by #jacobsa.
Here is a bash script for doing so:
# 1. Mount $BUCKET_NAME at $MOUNT_PT
# 2. Run this script
MOUNT_PT=${1:-HOME/mnt}
BUCKET_NAME=$2
DEL_OUTFILE=${3:-y} # Set to y or n
echo "Reading objects in $BUCKET_NAME"
OUTFILE=dir_names.txt
gsutil ls -r gs://$BUCKET_NAME/** | while read BUCKET_OBJ
do
dirname "$BUCKET_OBJ"
done | sort -u > $OUTFILE
echo "Processing directories found"
cat $OUTFILE | while read DIR_NAME
do
LOCAL_DIR=`echo "$DIR_NAME" | sed "s=gs://$BUCKET_NAME/==" | sed "s=gs://$BUCKET_NAME=="`
#echo $LOCAL_DIR
TARG_DIR="$MOUNT_PT/$LOCAL_DIR"
if ! [ -d "$TARG_DIR" ]
then
echo "Creating $TARG_DIR"
mkdir -p "$TARG_DIR"
fi
done
if [ $DEL_OUTFILE = "y" ]
then
rm $OUTFILE
fi
echo "Process complete"
I wrote this script, and have shared it at https://github.com/mherzog01/util/blob/main/sh/mk_bucket_dirs.sh.
This script assumes that you have mounted a GCS bucket locally on a Linux (or similar) system. The script first specifies the GCS bucket and location where the bucket is mounted. It then identifies all "directories" in the GCS bucket which are not visible locally, and creates them.
This (for me) fixed the issue with folders (and associated objects) not showing up in the mounted folder structure.

using wget to overwrite file but use temporary filename until full file is received, then rename

I'm using wget in a cron job to fetch a .jpg file into a web server folder once per minute (with same filename each time, overwriting). This folder is "live" in that the web server also serves that image from there. However if someone web-browses to that page during the time the image is being fetched, it is considered a jpg with errors and says so in the browser. So what I need to do is, similar to when Firefox is downloading a file, wget should write to a temporary file, either in /var or in the destination folder but with a temporary name, until it has the whole thing, then rename in an atomic (or at least negligible-duration) step.
I've read the wget man page and there doesn't seem to be a command line option for this. Have I missed it? Or do I need to do two commands in my cron job, a wget and a move?
There is no way to do this purely with GNU Wget.
wget's job is to download files and it does that. A simple one line script can achieve what you're looking for:
$ wget -O myfile.jpg.tmp example.com/myfile.jpg && mv myfile.jpg{.tmp,}
Since mv is atomic, atleast on Linux, you get the atomic update of a ready file.
Just wanted to share my solution:
alias wget='func(){ (wget --tries=0 --retry-connrefused --timeout=30 -O download_pkg.tmp "$1" && mv download_pkg.tmp "${1##*/}") || rm download_pkg.tmp; unset -f func; }; func
it creates a function that receives a parameter "url" to download the file to a temporary name. If it is successful, it is renamed to the correct filename extracted from parameter $1 with ${1##*/}. and if it fails, deletes the temp file. If the operation is aborted, the temp file will be replace on the next run. after all, unset -f removes the function definition as the alias is executed.