Folders not showing up in Bucket storage - google-cloud-storage

So my problem is that a have a few files not showing up in gcsfuse when mounted. I see them in the online console and if I 'ls' with gsutils.
Also, if If I manually create the folder in the bucket, i then can see the files inside it, but I need to create it first. Any suggestions?
gs://mybucket/
dir1/
ok.txt
dir2
lafu.txt
If I mount mybucket with gcsfuse and do 'ls' it only returns dir1/ok.txt.
Then I'll create the folder dir2 inside dir1 at the root of the mounting point, and suddenly 'lafu.txt' shows up.

By default, gcsfuse won't show a directory "implicitly" defined by a file with a slash in its name. For example if your bucket contains an object named dir/foo.txt, you won't be able to find it unless there is also an object nameddir/.
You can work around this by setting the --implicit-dirs flag, but there are good reasons why this is not the default. See the documentation for more information.

Google Cloud Storage doesn't have folders. The various interfaces use different tricks to pretend that folders exist, but ultimately there's just an object whose name contains a bunch of slashes. For example, "pictures/january/0001.jpg" is the full name of a single object.
If you need to be sure that a "folder" exists, put an object inside it.

#Brandon Yarbrough suggests creating needed directory entries in the GCS bucket. This avoids the performance penalty described by #jacobsa.
Here is a bash script for doing so:
# 1. Mount $BUCKET_NAME at $MOUNT_PT
# 2. Run this script
MOUNT_PT=${1:-HOME/mnt}
BUCKET_NAME=$2
DEL_OUTFILE=${3:-y} # Set to y or n
echo "Reading objects in $BUCKET_NAME"
OUTFILE=dir_names.txt
gsutil ls -r gs://$BUCKET_NAME/** | while read BUCKET_OBJ
do
dirname "$BUCKET_OBJ"
done | sort -u > $OUTFILE
echo "Processing directories found"
cat $OUTFILE | while read DIR_NAME
do
LOCAL_DIR=`echo "$DIR_NAME" | sed "s=gs://$BUCKET_NAME/==" | sed "s=gs://$BUCKET_NAME=="`
#echo $LOCAL_DIR
TARG_DIR="$MOUNT_PT/$LOCAL_DIR"
if ! [ -d "$TARG_DIR" ]
then
echo "Creating $TARG_DIR"
mkdir -p "$TARG_DIR"
fi
done
if [ $DEL_OUTFILE = "y" ]
then
rm $OUTFILE
fi
echo "Process complete"
I wrote this script, and have shared it at https://github.com/mherzog01/util/blob/main/sh/mk_bucket_dirs.sh.
This script assumes that you have mounted a GCS bucket locally on a Linux (or similar) system. The script first specifies the GCS bucket and location where the bucket is mounted. It then identifies all "directories" in the GCS bucket which are not visible locally, and creates them.
This (for me) fixed the issue with folders (and associated objects) not showing up in the mounted folder structure.

Related

How to correctly use `gsutil -q stat` in scripts?

I am creating a KSH script to check whether a subdirectory is exist on GCS bucket. I am writing the script like this:
#!/bin/ksh
set -e
set -o pipefail
gsutil -q stat ${DESTINATION_PATH}/
PATH_EXIST=$?
if [ ${PATH_EXIST} -eq 0 ] ; then
# do something
fi
Weird thing happens when the ${DESTINATION_PATH}/ does not exist, the script exit without evaluating PATH_EXIST=$?. If ${DESTINATION_PATH}/ is exist, the script will run normally as expected.
Why does that thing happen? How can I do better?
The statement set -e implies that your script will be exited if a command exits with a non-zero status.
The gsutil stat command can be used to check wheter an object exists:
gsutil -q stat gs://some-bucket/some-object
It has an exit status of 0 for an existing object and 1 for a non-existent object.
However it is advised against to use it with subdirectories:
Note: Unlike the gsutil ls command, the stat command does not support
operations on sub-directories. For example, if you run the command:
gsutil -q stat gs://some-bucket/some-subdir/
gsutil will look for
information about an object called some-subdir/ (with a trailing
slash) inside the bucket some-bucket, as opposed to operating on
objects nested under gs://some-bucket/some-subdir/. Unless you
actually have an object with that name, the operation will fail.
The reason because your command is not failing when your ${DESTINATION_PATH}/ exists is because if you create the folder using the Cloud Console i.e the UI, then a placeholder object will be created with its name. But let me be clear, folders don't exist in Google Cloud Storage, they are just a visualization of the bucket objects hierarchy.
So if you upload an object named newFolder/object to your bucket and the newFolder does not exists, it will be "created" but your gsutil -q stat ${DESTINATION_PATH}/ will return exit code 1. However if you create the folder using the UI and run the same command it will return exit 0. Thus follow the documentation, and avoid using it for checking if a directory exists.
Instead if you want to check whether a subdirectory exists just check if it contains any object inside:
gsutil -q stat ${DESTINATION_PATH}/*
Which will return 0 if any object is in the subdirectory and 1 otherwise.

What order does find(1) list files in?

On extfs, if there are only file-creations and no -deletions in a directory, I expect that find . -type f would list the files either in their chronological order of creation (or mtime), or if not, at least in their reverse chronological order... depending on how a directory's contents are traversed.
But that isn't the behavior I'm seeing.
The following code, eg, creates a fresh set of directories and files:
#!/bin/bash -u
for i in a/ a/{1,2,3,4,5} b/ b/{1,2,3,4,5}; do
if echo "$i" | egrep -q "/$"; then
echo "Creating dir $i"
mkdir -p "$i"
else
echo "Creating file $i"
touch "$i"
fi
sleep 0.500
done
Output of the above snippet:
Creating dir a/
Creating file a/1
Creating file a/2
Creating file a/3
Creating file a/4
Creating file a/5
Creating dir b/
Creating file b/1
Creating file b/2
Creating file b/3
Creating file b/4
Creating file b/5
However, find lists files in somewhat random order. For example, a/2 doesn't follows a/1, and b/2 doesn't follow b/1:
$ find . -type f
./a/1
./a/3
./a/4
./a/2 <----
./a/5
./b/1
./b/3
./b/4
./b/2 <----
./b/5
Any idea why this should happen?
My main problem is: I have a very large volume storing 100s of 1000s of files. I need to traverse these files and directories in the order of their creation/modification (mtime) and pipe each file to another process for further processing. But I don't necessarily want to first create a temporary list of this large set of files and then sort it based on mtime before piping it to my process.
find lists objects in the order that they are reported by the underlying filesystem implementation. You can tell ls to show you this "raw" order by passing it the -f option.
The order could be anything at all -- alphabetical, by mtime, by atime, by length of name, by permissions, or something completely different. The ordering can even vary from one listing to the next.
It's common for filesystems to report in an order that reflects the filesystem's strategy for allocating directory slots to files. If this is some sort of hash-based strategy based on filename then the order can appear nonsensical. This is what happens with widely-used Linux and BSD filesystem implementations. Since you mention extfs this is probably what causes the ordering you're seeing.
So, if you need the output from find to be ordered in a particular way then you'll have to create that order yourself. Maybe based on something like:
find . -type f -exec ls -ltr --time-style=+%s {} \; | sort -n -k6

Logrotate files in multiple sub directories to backup location in same folder structure

Im trying to use logrotate with very little experience, Currently working i have the files rotating, compressing and renaming into the same folder. Now i need instead of dropping the files in the same place, i need to have them dropped in another location. They also need to have the same folder structure and if it isn't there than it needs to create the new folder. All the compressed files need to be added and not override the existing files
I'm thinking that the olddir will drop them into a destination folder but not sure on how to have it drop it in the corresponding folder or create it if its not already there.
Example source
var/log/device1/*.log
var/log/device2/*.log
var/log/device3/*.log
Example Destination to drop .gz files into
opt/archive/device1/
opt/archvie/device2/
(needs to create opt/archive/device3 and put rotated file in here)
Didn't end up finding a way to move with logrotate but came up with script to do the same sort of thing. pretty simplistic and wont work for more than 1 level deep of subfolders.
#!/bin/bash
source="/opt/log/host"
destination="/opt/archive/"
for i in $(find $source -maxdepth 2 -type f -name "*.gz")
do
#removing /opt/log/host from string
dd="$( echo "$i" | sed -e 's#^/opt/log/host/##' )"
#removing everything after the first /
ff=$( echo "$dd" | cut -f1 -d"/" )
#setting the correct destination string
ee=$destination$dd
#create new folders if they do not exist
mkdir -p -- "$destination$ff"
#move files
mv $i $ee
done

AWS S3, Deleting files from local directory after upload

I have backup files in different directories in one drive. Files in those directories can be quite big up to 800GB or so. So I have a batch file with a set of scripts which upload/syncs files to S3.
See example below:
aws s3 sync R:\DB_Backups3\System s3://usa-daily/System/ --exclude "*" --include "*/*/Diff/*"
The upload time can vary but so far so good.
My question is, how do I edit the script or create a new one which checks in the s3 bucket that the files have been uploaded and ONLY if they have been uploaded then deleted them from the local drive, if not leave them on the drive?
(Ideally it would check each file)
I'm not familiar with aws s3, or aws cli command that can do that? Please let me know if I made myself clear or if you need more details.
Any help will be very appreciated.
Best would be to use mv with --recursive parameter for multiple files
When passed with the parameter --recursive, the following mv command recursively moves all files under a specified directory to a specified bucket and prefix while excluding some files by using an --exclude parameter. In this example, the directory myDir has the files test1.txt and test2.jpg:
aws s3 mv myDir s3://mybucket/ --recursive --exclude "*.jpg"
Output:
move: myDir/test1.txt to s3://mybucket2/test1.txt
Hope this helps.
As the answer by #ketan shows, Amazon aws client cannot do batch move.
You can use WinSCP put -delete command instead:
winscp.com /log=S3.log /ini=nul /command ^
"open s3://S3KEY:S3SECRET#s3.amazonaws.com/" ^
"put -delete C:\local\path\* /bucket/" ^
"exit"
You need to URL-encode special characters in the credentials. WinSCP GUI can generate an S3 script template, like the one above, for you.
Alternatively, since WinSCP 5.19, you can use -username and -password switches, which do not need any encoding:
"open s3://s3.amazonaws.com/ -username=S3KEY -password=S3SECRET" ^
(I'm the author of WinSCP)

File movement issue on NFS file system on Unix box

Currently there are 4.5 million files in a single directory on an NFS file system. As a result any read or write operation on that directory is causing a huge delay.
In order to over come this problem, all the files in that directory will be moved onto different directories based on the year of its creation.
Apparently, the find command that we are using with the -ctime option is not working because of the huge file volume.
We tried listing the files based on the year of creation and then feed the list to a script that will move them in a for loop. But even this failed as ls -lrt went for a hang.
Is there any other way to tackle this problem?
Please help.
Script contents:
1) filelist.sh
ls -tlr|awk '{print $8,$9,$6,$7}'|grep ^2011|awk '{print $2,$1,$3,$4}' 1>>inboundstore_$1.txt 2>>Error_$1.log
ls -tlr|awk '{print $8,$9,$6,$7}'|grep ^2011|wc -l 1>>count_$1.log
2) filemove.sh
INPUT_FILE=$1 ##text file which has the list of files from the previous script
FINAL_LOCATION=$2 ##destination directory
if [ -r $INPUT_FILE ]
then
for file in `cat $INPUT_FILE`
do
echo "TIME OF FILE COPY OF [$file] IS : `date`" >> xyz/IBSCopyTime.log
mv $file $FINAL_LOCATION
done
else
echo "$INPUT_FILE does not exist"
fi
Use the readdir iterator.