Google Cloud Storage with double slashes in the file path - google-cloud-storage

This issue cost me some time as some files seemed to disappear. I'm not sure if this is the correct behavior when (in my case inadvertently) a file with double slashes is added to Google Cloud Storage.
This is how to recreate the issue:
$ vi test (create a simple file)
$ gsutil cp test gs://xxxxx.appspot.com/test/doubleslash//test
Copying file://test [Content-Type=application/octet-stream]...
Uploading gs://xxxxx.appspot.com/test/doubleslash//test: 15 B/15 B
// attempting gsutil ls command with one slash ... only the folder shows
$ gsutil ls -al gs://xxxxx.appspot.com/test/doubleslash/
gs://xxxxx.appspot.com/test/doubleslash//
// attempting gsutil ls command with two slashes ... only the folder shows
$ gsutil ls -al gs://xxxxx.appspot.com/test/doubleslash//
gs://xxxxx.appspot.com/test/doubleslash//
$
Note: the "test" file isn't shown. However, supplying the full file path, it does appear:
$ gsutil ls -al gs://xxxxx.appspot.com/test/doubleslash//test
15 2015-11-27T17:21:24Z gs://xxxxx.appspot.com/test/doubleslash//test#1448644884041000 metageneration=1
TOTAL: 1 objects, 15 bytes (15 B)
Finally, the really odd part is that when adding a file into the single slash folder (called testInside), then doing "gsutil ls" command with the double slash, the file in the single slash folder appears, but the one in the double slash folder does not:
$ gsutil cp test gs://xxxxx.appspot.com/test/doubleslash/testInside
Copying file://test [Content-Type=application/octet-stream]...
Uploading …/xxxxx.appspot.com/test/doubleslash/testInside: 15 B/15 B
$ gsutil ls -al gs://xxxxx.appspot.com/test/doubleslash//
gs://xxxxx.appspot.com/test/doubleslash//
15 2015-11-27T20:14:42Z gs://xxxxx.appspot.com/test/doubleslash/testInside#1448655282555000 metageneration=1
TOTAL: 1 objects, 15 bytes (15 B)
Now, via the app when listing the folder, the files in the double slash folder do appear along with any files in the single slash folder (when specifying one slash):
build.gradle:
compile ('com.google.apis:google-api-services-storage:v1-rev54-1.21.0')
Java code:
Objects response = storageService.objects()
.list(“xxxxx.appspot.com")
.setPrefix(“test/doubleslash/").execute();
I cannot determine the reasoning for the file inside the double slash folder appearing from code, but not from the gsutil command line. comments?

Related

How to delete specific sub directory with wildcard gsutil?

Inspired by this question i am trying to delete specific folders from my bucket using wildcard in a gsutil command such as :
gsutil rm -r gs://bucket-name/path/to/**/content
or
gsutil rm -r gs://bucket-name/path/to/*/content
This is throwing the error :
zsh: no matches found: gs://bucket-name/path/to/**/content
zsh: no matches found: gs://bucket-name/path/to/*/content
Where the * or ** is replacing IDs (thousands of records) and under each there are 2 directories : content, content2 and i only want to remove the content directory
Thanks in advance
As per this answer by #Mike Schwartz, You have to use single or double quotes while using wildcards.
zsh is attempting to expand the wildcard before gsutil sees it (and is complaining that you have no local files matching that wildcard). Please try this, to prevent zsh from doing so:
gsutil rm 'gs://bucket/**'

gsutil command to delete old files from last day

I have a bucket in google cloud storage. I have a tmp folder in bucket. Thousands of files are being created each day in this directory. I want to delete files that are older than 1 day every night. I could not find an argument on gsutil for this job. I had to use a classic and simple shell script to do this. But the files are deleting very slowly.
I have 650K files accumulated in the folder. 540K of them must be deleted. But my own shell script worked for 1 day and only 34K files could be deleted.
The gsutil lifecycle feature is not able to do exactly what I want. He's cleaning the whole bucket. I just want to delete the files regularly at the bottom of certain folder.. At the same time I want to do deletion faster.
I'm open to your suggestions and your help. Can I do this with a single gsutil command? or a different method?
simple script I created for testing (I prepared to delete bulk files temporarily.)
## step 1 - I pull the files together with the date format and save them to the file list1.txt.
gsutil -m ls -la gs://mygooglecloudstorage/tmp/ | awk '{print $2,$3}' > /tmp/gsutil-tmp-files/list1.txt
## step 2 - I filter the information saved in the file list1.txt. Based on the current date, I save the old dated files to file list2.txt.
cat /tmp/gsutil-tmp-files/list1.txt | awk -F "T" '{print $1,$2,$3}' | awk '{print $1,$3}' | awk -F "#" '{print $1}' |grep -v `date +%F` |sort -bnr > /tmp/gsutil-tmp-files/list2.txt
## step 3 - After the above process, I add the gsutil delete command to the first line and convert it into a shell script.
cat /tmp/gsutil-tmp-files/list2.txt | awk '{$1 = "/root/google-cloud-sdk/bin/gsutil -m rm -r "; print}' > /tmp/gsutil-tmp-files/remove-old-files.sh
## step 4 - I'm set the script permissions and delete old lists.
chmod 755 /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/list1.txt /tmp/gsutil-tmp-files/list2.txt
## step 5 - I run the shell script and I destroy it after it is done.
/bin/sh /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/remove-old-files.sh
There is a very simple way to do this, for example:
gsutil -m ls -l gs://bucket-name/ | grep 2017-06-23 | grep .jpg | awk '{print $3}' | gsutil -m rm -I
There isn't a simple way to do this with gsutil or object lifecycle management as of today.
That being said, would it be feasible for you to change the naming format for the objects in your bucket? That is, instead of uploading them all under "gs://mybucket/tmp/", you could append the current date to that prefix, resulting in something like "gs://mybucket/tmp/2017-12-27/". The main advantages to this would be:
Not having to do a date comparison for every object; you could run gsutil ls "gs://mybucket/tmp/" | grep "gs://[^/]\+/tmp/[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}/$" to find those prefixes, then do date comparisons on the last portion of of those paths.
Being able to supply a smaller number of arguments on the command line (prefixes, rather than the name of each individual file) to gsutil -m rm -r, thus being less likely to pass in more arguments than your shell can handle.

how to print the progress of the files being copied in bash [duplicate]

I suppose I could compare the number of files in the source directory to the number of files in the target directory as cp progresses, or perhaps do it with folder size instead? I tried to find examples, but all bash progress bars seem to be written for copying single files. I want to copy a bunch of files (or a directory, if the former is not possible).
You can also use rsync instead of cp like this:
rsync -Pa source destination
Which will give you a progress bar and estimated time of completion. Very handy.
To show a progress bar while doing a recursive copy of files & folders & subfolders (including links and file attributes), you can use gcp (easily installed in Ubuntu and Debian by running "sudo apt-get install gcp"):
gcp -rf SRC DEST
Here is the typical output while copying a large folder of files:
Copying 1.33 GiB 73% |##################### | 230.19 M/s ETA: 00:00:07
Notice that it shows just one progress bar for the whole operation, whereas if you want a single progress bar per file, you can use rsync:
rsync -ah --progress SRC DEST
You may have a look at the tool vcp. Thats a simple copy tool with two progress bars: One for the current file, and one for overall.
EDIT
Here is the link to the sources: http://members.iinet.net.au/~lynx/vcp/
Manpage can be found here: http://linux.die.net/man/1/vcp
Most distributions have a package for it.
Here another solution: Use the tool bar
You could invoke it like this:
#!/bin/bash
filesize=$(du -sb ${1} | awk '{ print $1 }')
tar -cf - -C ${1} ./ | bar --size ${filesize} | tar -xf - -C ${2}
You have to go the way over tar, and it will be inaccurate on small files. Also you must take care that the target directory exists. But it is a way.
My preferred option is Advanced Copy, as it uses the original cp source files.
$ wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.21.tar.xz
$ tar xvJf coreutils-8.21.tar.xz
$ cd coreutils-8.21/
$ wget --no-check-certificate wget https://raw.githubusercontent.com/jarun/advcpmv/master/advcpmv-0.8-8.32.patch
$ patch -p1 -i advcpmv-0.8-8.32.patch
$ ./configure
$ make
The new programs are now located in src/cp and src/mv. You may choose to replace your existing commands:
$ sudo cp src/cp /usr/local/bin/cp
$ sudo cp src/mv /usr/local/bin/mv
Then you can use cp as usual, or specify -g to show the progress bar:
$ cp -g src dest
A simple unix way is to go to the destination directory and do watch -n 5 du -s . Perhaps make it more pretty by showing as a bar . This can help in environments where you have just the standard unix utils and no scope of installing additional files . du-sh is the key , watch is to just do every 5 seconds.
Pros : Works on any unix system Cons : No Progress Bar
To add another option, you can use cpv. It uses pv to imitate the usage of cp.
It works like pv but you can use it to recursively copy directories
You can get it here
There's a tool pv to do this exact thing: http://www.ivarch.com/programs/pv.shtml
There's a ubuntu version in apt
How about something like
find . -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /DEST/$(dirname {})
It finds all the files in the current directory, pipes that through PV while giving PV an estimated size so the progress meter works and then piping that to a CP command with the --parents flag so the DEST path matches the SRC path.
One problem I have yet to overcome is that if you issue this command
find /home/user/test -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /www/test/$(dirname {})
the destination path becomes /www/test/home/user/test/....FILES... and I am unsure how to tell the command to get rid of the '/home/user/test' part. That why I have to run it from inside the SRC directory.
Check the source code for progress_bar in the below git repository of mine
https://github.com/Kiran-Bose/supreme
Also try custom bash script package supreme to verify how progress bar work with cp and mv comands
Functionality overview
(1)Open Apps
----Firefox
----Calculator
----Settings
(2)Manage Files
----Search
----Navigate
----Quick access
|----Select File(s)
|----Inverse Selection
|----Make directory
|----Make file
|----Open
|----Copy
|----Move
|----Delete
|----Rename
|----Send to Device
|----Properties
(3)Manage Phone
----Move/Copy from phone
----Move/Copy to phone
----Sync folders
(4)Manage USB
----Move/Copy from USB
----Move/Copy to USB
There is command progress, https://github.com/Xfennec/progress, coreutils progress viewer.
Just run progress in another terminal to see the copy/move progress. For continuous monitoring use -M flag.

How to count number of file in a bucket-folder with gsutil

Is there an option to count the number of files in bucket-folders?
Like:
gsutil ls -count -recursive gs://bucket/folder
Result: 666 files
I just want an total number of files to compare the
amount to the sync-folder on my server.
I don't get it in the manual.
The gsutil ls command with options -l (long listing) and -R (recursive listing) will list the entire bucket recursively and then produce a total count of all objects, both files and directories, at the end:
$ gsutil ls -lR gs://pub
104413 2011-04-03T20:58:02Z gs://pub/SomeOfTheTeam.jpg
172 2012-06-18T21:51:01Z gs://pub/cloud_storage_storage_schema_v0.json
1379 2012-06-18T21:51:01Z gs://pub/cloud_storage_usage_schema_v0.json
1767691 2013-09-18T07:57:42Z gs://pub/gsutil.tar.gz
2445111 2013-09-18T07:57:44Z gs://pub/gsutil.zip
1136 2012-07-19T16:01:05Z gs://pub/gsutil_2.0.ReleaseNotes.txt
... <snipped> ...
gs://pub/apt/pool/main/p/python-socksipy-branch/:
10372 2013-06-10T22:52:58Z gs://pub/apt/pool/main/p/python-socksipy-branch/python-socksipy-branch_1.01_all.deb
gs://pub/shakespeare/:
84 2010-05-07T23:36:25Z gs://pub/shakespeare/rose.txt
TOTAL: 144 objects, 102723169 bytes (97.96 MB)
If you really just want the total, you can pipe the output to the tail command:
$ gsutil ls -lR gs://pub | tail -n 1
TOTAL: 144 objects, 102723169 bytes (97.96 MB)
UPDATE
gsutil now has a du command. This makes it even easier to get a count:
$ gsutil du gs://pub | wc -l
232
If you have the option to not use gsutil, the easiest way is to check it on Google Cloud Platform.
Go to Monitoring > Metrics explorer :
Resource type : GCS Bucket
Metric : Object count
Then, in the table below, you have for each bucket the number of document it contains.
You want to gsutil ls -count -recursive in gs://bucket/folder?
Alright; gsutil ls gs://bucket/folder/** will list just full urls of the paths to files under gs://bucket/folder without the footer or the lines ending in a colon. Piping that to wc -l will give you the line-count of the result.
gsutil ls gs://bucket/folder/** | wc -l
gsutil ls -lR gs://Floder1/Folder2/Folder3/** |tail -n 1
As someone that had 4.5M objects in a bucket, I used gsutil du gs://bucket/folder | wc -l which took ~24 min
This gist shows how to iterate through all Cloud Storage buckets and list the number of objects in each. Compliments of #vinoaj
for VARIABLE in $(gsutil ls)
do
echo $(gsutil du $VARIABLE | grep -v /$ | wc -l) $VARIABLE
done
To filter buckets, add a grep such as for VARIABLE in $(gsutil ls | grep "^gs://bucketname")
In the console, you can click Activate Cloud Shell in the top right and paste this in to get results. If you save the commands as a bash script, then run chmod u+x program_name so the script can run in the GCP Cloud Shell.
NOTE: When you do gsutil du gs://my-bucket/logs | wc -l the result includes an "extra" result for each bucket and sub-directory. For example, 3 files in a top-level bucket will be 4. 3 files in a sub-directory will be 5.
This doesn't work recursively, but you can also get the count of a single large folder from the console. This method has the advantage of being very fast.
Select Sort and filter from the filter menu in your bucket.
Reverse the sort order to let Google Cloud Storage calculate the number of files/folders.
View the count of files/folders in the current folder.

How to use Rsync to copy only specific subdirectories (same names in several directories)

I have such directories structure on server 1:
data
company1
unique_folder1
other_folder
...
company2
unique_folder1
...
...
And I want duplicate this folder structure on server 2, but copy only directories/subdirectories of unique_folder1. I.e. as result must be:
data
company1
unique_folder1
company2
unique_folder1
...
I know that rsync is very good for this.
I've tried 'include/exclude' options without success.
E.g. I've tried:
rsync -avzn --list-only --include '*/unique_folder1/**' --exclude '*' -e ssh user#server.com:/path/to/old/data/ /path/to/new/data/
But, as result, I don't see any files/directories:
receiving file list ... done
sent 43 bytes received 21 bytes 42.67 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
What's wrong? Ideas?
Additional information:
I have sudo access to both servers. One idea I have - is to use find command and cpio together to copy to new directory with content I need and after that use Rsync. But this is very slow, there are a lot of files, etc.
I've found the reason. As for me - it wasn't clear that Rsync works in this way.
So correct command (for company1 directory only) must be:
rsync -avzn --list-only --include 'company1/' --include 'company1/unique_folder1/***' --exclude '*' -e ssh user#server.com:/path/to/old/data/ /path/to/new/data
I.e. we need include each parent company directory. And of course we cannot write manually all these company directories in the command line, so we save the list into the file and use it.
Final things we need to do:
1.Generate include file on server 1, so its content will be (I've used ls and awk):
+ company1/
+ company1/unique_folder1/***
...
+ companyN/
+ companyN/unique_folder1/***
2.Copy include.txt to server 2 and use such command:
rsync -avzn \
--list-only \
--include-from '/path/to/new/include.txt' \
--exclude '*' \
-e ssh user#server.com:/path/to/old/data/ \
/path/to/new/data
If the first matching pattern excludes a directory, then all its descendants will never be traversed. When you want to include a deep directory e.g. company*/unique_folder1/** but exclude everything else *, you need to tell rsync to include all its ancestors too:
rsync -r -v --dry-run \
--include='/' \
--include='/company*/' \
--include='/company*/unique_folder1/' \
--include='/company*/unique_folder1/**' \
--exclude='*'
You can use bash’s brace expansion to save some typing. After brace expansion, the following command is exactly the same as the previous one:
rsync -r -v --dry-run --include=/{,'company*/'{,unique_folder1/{,'**'}}} --exclude='*'
An alternative to Andron's Answer which is simpler to both understand and implement in many cases is to use the --files-from=FILE option. For the current problem,
rsync -arv --files-from='list.txt' old_path/data new_path/data
Where list.txt is simply
company1/unique_folder1/
company2/unique_folder1/
...
Note the -r flag must be included explicitly since --files-from turns off this behaviour of the -a flag. It also seems to me that the path construction is different from other rsync commands, in that company1/unique_folder1/ matches but /data/company1/unique_folder1/ does not.
For example, if you only want to sync target/classes/ and target/lib/ to a remote system, do
rsync -vaH --delete --delete-excluded --include='classes/***' --include='lib/***' \
--exclude='*' target/ user#host:/deploy/path/
The important things to watch:
Don't forget the "/" from the end of the pathes, or you will get a copy into subdirectory.
The order of the --include, --exclude counts.
Contrary the other answers, starting with "/" an include/exclude parameter is unneeded, they will automatically appended to the source directory (target/ in the example).
To test, what exactly will happen, we can use a --dry-run flags, as the other answers say.
--delete-excluded will delete all content in the target directory, except the subdirectories we specifically included. It should be used wisely! On this reason, a --delete is not enough, it does not deletes the excluded files on the remote side by default (every other, yes), it should be given beside the ordinary --delete, again.