gsutil command to delete old files from last day - google-cloud-storage

I have a bucket in google cloud storage. I have a tmp folder in bucket. Thousands of files are being created each day in this directory. I want to delete files that are older than 1 day every night. I could not find an argument on gsutil for this job. I had to use a classic and simple shell script to do this. But the files are deleting very slowly.
I have 650K files accumulated in the folder. 540K of them must be deleted. But my own shell script worked for 1 day and only 34K files could be deleted.
The gsutil lifecycle feature is not able to do exactly what I want. He's cleaning the whole bucket. I just want to delete the files regularly at the bottom of certain folder.. At the same time I want to do deletion faster.
I'm open to your suggestions and your help. Can I do this with a single gsutil command? or a different method?
simple script I created for testing (I prepared to delete bulk files temporarily.)
## step 1 - I pull the files together with the date format and save them to the file list1.txt.
gsutil -m ls -la gs://mygooglecloudstorage/tmp/ | awk '{print $2,$3}' > /tmp/gsutil-tmp-files/list1.txt
## step 2 - I filter the information saved in the file list1.txt. Based on the current date, I save the old dated files to file list2.txt.
cat /tmp/gsutil-tmp-files/list1.txt | awk -F "T" '{print $1,$2,$3}' | awk '{print $1,$3}' | awk -F "#" '{print $1}' |grep -v `date +%F` |sort -bnr > /tmp/gsutil-tmp-files/list2.txt
## step 3 - After the above process, I add the gsutil delete command to the first line and convert it into a shell script.
cat /tmp/gsutil-tmp-files/list2.txt | awk '{$1 = "/root/google-cloud-sdk/bin/gsutil -m rm -r "; print}' > /tmp/gsutil-tmp-files/remove-old-files.sh
## step 4 - I'm set the script permissions and delete old lists.
chmod 755 /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/list1.txt /tmp/gsutil-tmp-files/list2.txt
## step 5 - I run the shell script and I destroy it after it is done.
/bin/sh /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/remove-old-files.sh

There is a very simple way to do this, for example:
gsutil -m ls -l gs://bucket-name/ | grep 2017-06-23 | grep .jpg | awk '{print $3}' | gsutil -m rm -I

There isn't a simple way to do this with gsutil or object lifecycle management as of today.
That being said, would it be feasible for you to change the naming format for the objects in your bucket? That is, instead of uploading them all under "gs://mybucket/tmp/", you could append the current date to that prefix, resulting in something like "gs://mybucket/tmp/2017-12-27/". The main advantages to this would be:
Not having to do a date comparison for every object; you could run gsutil ls "gs://mybucket/tmp/" | grep "gs://[^/]\+/tmp/[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}/$" to find those prefixes, then do date comparisons on the last portion of of those paths.
Being able to supply a smaller number of arguments on the command line (prefixes, rather than the name of each individual file) to gsutil -m rm -r, thus being less likely to pass in more arguments than your shell can handle.

Related

List and operate on files above a certain size?

How do you erase all files bellow a certain size with gsutil? We could use a script to filter output of gsutil ls but sounds overkill.
gsutil doesn't have any direct support (as in command line flags) for operating only on files below a given size. Instead you'd have to use a script, such as:
gsutil ls -l gs://your-bucket | awk '{if ($1 < 1024) print $NF}' | xargs some-command

how to print the progress of the files being copied in bash [duplicate]

I suppose I could compare the number of files in the source directory to the number of files in the target directory as cp progresses, or perhaps do it with folder size instead? I tried to find examples, but all bash progress bars seem to be written for copying single files. I want to copy a bunch of files (or a directory, if the former is not possible).
You can also use rsync instead of cp like this:
rsync -Pa source destination
Which will give you a progress bar and estimated time of completion. Very handy.
To show a progress bar while doing a recursive copy of files & folders & subfolders (including links and file attributes), you can use gcp (easily installed in Ubuntu and Debian by running "sudo apt-get install gcp"):
gcp -rf SRC DEST
Here is the typical output while copying a large folder of files:
Copying 1.33 GiB 73% |##################### | 230.19 M/s ETA: 00:00:07
Notice that it shows just one progress bar for the whole operation, whereas if you want a single progress bar per file, you can use rsync:
rsync -ah --progress SRC DEST
You may have a look at the tool vcp. Thats a simple copy tool with two progress bars: One for the current file, and one for overall.
EDIT
Here is the link to the sources: http://members.iinet.net.au/~lynx/vcp/
Manpage can be found here: http://linux.die.net/man/1/vcp
Most distributions have a package for it.
Here another solution: Use the tool bar
You could invoke it like this:
#!/bin/bash
filesize=$(du -sb ${1} | awk '{ print $1 }')
tar -cf - -C ${1} ./ | bar --size ${filesize} | tar -xf - -C ${2}
You have to go the way over tar, and it will be inaccurate on small files. Also you must take care that the target directory exists. But it is a way.
My preferred option is Advanced Copy, as it uses the original cp source files.
$ wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.21.tar.xz
$ tar xvJf coreutils-8.21.tar.xz
$ cd coreutils-8.21/
$ wget --no-check-certificate wget https://raw.githubusercontent.com/jarun/advcpmv/master/advcpmv-0.8-8.32.patch
$ patch -p1 -i advcpmv-0.8-8.32.patch
$ ./configure
$ make
The new programs are now located in src/cp and src/mv. You may choose to replace your existing commands:
$ sudo cp src/cp /usr/local/bin/cp
$ sudo cp src/mv /usr/local/bin/mv
Then you can use cp as usual, or specify -g to show the progress bar:
$ cp -g src dest
A simple unix way is to go to the destination directory and do watch -n 5 du -s . Perhaps make it more pretty by showing as a bar . This can help in environments where you have just the standard unix utils and no scope of installing additional files . du-sh is the key , watch is to just do every 5 seconds.
Pros : Works on any unix system Cons : No Progress Bar
To add another option, you can use cpv. It uses pv to imitate the usage of cp.
It works like pv but you can use it to recursively copy directories
You can get it here
There's a tool pv to do this exact thing: http://www.ivarch.com/programs/pv.shtml
There's a ubuntu version in apt
How about something like
find . -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /DEST/$(dirname {})
It finds all the files in the current directory, pipes that through PV while giving PV an estimated size so the progress meter works and then piping that to a CP command with the --parents flag so the DEST path matches the SRC path.
One problem I have yet to overcome is that if you issue this command
find /home/user/test -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /www/test/$(dirname {})
the destination path becomes /www/test/home/user/test/....FILES... and I am unsure how to tell the command to get rid of the '/home/user/test' part. That why I have to run it from inside the SRC directory.
Check the source code for progress_bar in the below git repository of mine
https://github.com/Kiran-Bose/supreme
Also try custom bash script package supreme to verify how progress bar work with cp and mv comands
Functionality overview
(1)Open Apps
----Firefox
----Calculator
----Settings
(2)Manage Files
----Search
----Navigate
----Quick access
|----Select File(s)
|----Inverse Selection
|----Make directory
|----Make file
|----Open
|----Copy
|----Move
|----Delete
|----Rename
|----Send to Device
|----Properties
(3)Manage Phone
----Move/Copy from phone
----Move/Copy to phone
----Sync folders
(4)Manage USB
----Move/Copy from USB
----Move/Copy to USB
There is command progress, https://github.com/Xfennec/progress, coreutils progress viewer.
Just run progress in another terminal to see the copy/move progress. For continuous monitoring use -M flag.

How to count number of file in a bucket-folder with gsutil

Is there an option to count the number of files in bucket-folders?
Like:
gsutil ls -count -recursive gs://bucket/folder
Result: 666 files
I just want an total number of files to compare the
amount to the sync-folder on my server.
I don't get it in the manual.
The gsutil ls command with options -l (long listing) and -R (recursive listing) will list the entire bucket recursively and then produce a total count of all objects, both files and directories, at the end:
$ gsutil ls -lR gs://pub
104413 2011-04-03T20:58:02Z gs://pub/SomeOfTheTeam.jpg
172 2012-06-18T21:51:01Z gs://pub/cloud_storage_storage_schema_v0.json
1379 2012-06-18T21:51:01Z gs://pub/cloud_storage_usage_schema_v0.json
1767691 2013-09-18T07:57:42Z gs://pub/gsutil.tar.gz
2445111 2013-09-18T07:57:44Z gs://pub/gsutil.zip
1136 2012-07-19T16:01:05Z gs://pub/gsutil_2.0.ReleaseNotes.txt
... <snipped> ...
gs://pub/apt/pool/main/p/python-socksipy-branch/:
10372 2013-06-10T22:52:58Z gs://pub/apt/pool/main/p/python-socksipy-branch/python-socksipy-branch_1.01_all.deb
gs://pub/shakespeare/:
84 2010-05-07T23:36:25Z gs://pub/shakespeare/rose.txt
TOTAL: 144 objects, 102723169 bytes (97.96 MB)
If you really just want the total, you can pipe the output to the tail command:
$ gsutil ls -lR gs://pub | tail -n 1
TOTAL: 144 objects, 102723169 bytes (97.96 MB)
UPDATE
gsutil now has a du command. This makes it even easier to get a count:
$ gsutil du gs://pub | wc -l
232
If you have the option to not use gsutil, the easiest way is to check it on Google Cloud Platform.
Go to Monitoring > Metrics explorer :
Resource type : GCS Bucket
Metric : Object count
Then, in the table below, you have for each bucket the number of document it contains.
You want to gsutil ls -count -recursive in gs://bucket/folder?
Alright; gsutil ls gs://bucket/folder/** will list just full urls of the paths to files under gs://bucket/folder without the footer or the lines ending in a colon. Piping that to wc -l will give you the line-count of the result.
gsutil ls gs://bucket/folder/** | wc -l
gsutil ls -lR gs://Floder1/Folder2/Folder3/** |tail -n 1
As someone that had 4.5M objects in a bucket, I used gsutil du gs://bucket/folder | wc -l which took ~24 min
This gist shows how to iterate through all Cloud Storage buckets and list the number of objects in each. Compliments of #vinoaj
for VARIABLE in $(gsutil ls)
do
echo $(gsutil du $VARIABLE | grep -v /$ | wc -l) $VARIABLE
done
To filter buckets, add a grep such as for VARIABLE in $(gsutil ls | grep "^gs://bucketname")
In the console, you can click Activate Cloud Shell in the top right and paste this in to get results. If you save the commands as a bash script, then run chmod u+x program_name so the script can run in the GCP Cloud Shell.
NOTE: When you do gsutil du gs://my-bucket/logs | wc -l the result includes an "extra" result for each bucket and sub-directory. For example, 3 files in a top-level bucket will be 4. 3 files in a sub-directory will be 5.
This doesn't work recursively, but you can also get the count of a single large folder from the console. This method has the advantage of being very fast.
Select Sort and filter from the filter menu in your bucket.
Reverse the sort order to let Google Cloud Storage calculate the number of files/folders.
View the count of files/folders in the current folder.

comparing two directories with separate diff output per file

I'd need to see what has been changed between two directories which contain different version of a software sourcecode. While I have found a way to get a unique .diff file, how can I obtain a different file for each changed file in the two directories? I'd need this, as the "main" is about 6 MB and wanted some more handy thing.
I came around this problem too, so I ended up with some lines of a shell script. It takes three arguments: Source and destination directory (as used for diff) and a target folder (should exist) for the output.
It's a bit hacky, but maybe it would be useful for someone. So use with care, especially if your paths have special characters.
#!/bin/sh
DIFFARGS="-wb"
LANG=C
TARGET=$3
SRC=`echo $1 | sed -e 's/\//\\\\\\//g'`
DST=`echo $2 | sed -e 's/\//\\\\\\//g'`
if [ ! -d "$TARGET" ]; then
echo "'$TARGET' is not a directory." >&2
exit 1
fi
diff -rqN $DIFFARGS "$1" "$2" | sed "s/Files $SRC\/\(.*\?\) and $DST\/\(.*\?\) differ/\1/" | \
while read file
do
if [ ! -d "$TARGET/`dirname \"$file\"`" ]; then
mkdir -p "$TARGET/`dirname \"$file\"`"
fi
diff $DIFFARGS -N "$1/$file" "$2/$file" > "$TARGET"/"$file.diff"
done
if you want to compare source code it is better to commit it to a source vesioning program as "svn".
after you have done so. do a diff of your uploaded code and pipe it to file.diff
svn diff --old svn:url1 --new svn:url2 > file.diff
A bash for loop will work for you. The following will diff two directories with C source code and produce a separate diff for each file.
for FILE in $(find <FIRST_DIR> -name '*.[ch]'); do DIFF=<DIFF_DIR>/$(echo $FILE | grep -o '[-_a-zA-Z0-9.]*$').diff; diff -u $FILE <SECOND_DIR>/$FILE > $DIFF; done
Use the correct patch level for the lines starting with +++

How can I tell if a file is on a remote filesystem with Perl?

Is there a quick-and-dirty way to tell programmatically, in shell script or in Perl, whether a path is located on a remote filesystem (nfs or the like) or a local one? Or is the only way to do this to parse /etc/fstab and check the filesystem type?
stat -f -c %T <filename> should do what you want. You might also want -l
You can use "df -T" to get the filesystem type for the directory, or use the -t option to limit reporting to specific types (like nfs) and if it comes back with "no file systems processed", then it's not one of the ones you're looking for.
df -T $dir | tail -1 | awk '{print $2;}'
If you use df on a directory to get info only of the device it resides in, e.g. for the current directory:
df .
Then, you can just parse the output, e.g.
df . | tail -1 | awk '{print $1}'
to get the device name.
I have tested the following on solaris7,8,9 & 10 and it seems to be reliable
/bin/df -g <filename> | tail -2 | head -1 | awk '{print $1}'
Should give you have the fs type rather than trying to match for a "host:path" in your mount point.
On some systems, the device number is negative for NFS files. Thus,
print "remote" if (stat($filename))[0] < 0