How to count number of file in a bucket-folder with gsutil - google-cloud-storage

Is there an option to count the number of files in bucket-folders?
Like:
gsutil ls -count -recursive gs://bucket/folder
Result: 666 files
I just want an total number of files to compare the
amount to the sync-folder on my server.
I don't get it in the manual.

The gsutil ls command with options -l (long listing) and -R (recursive listing) will list the entire bucket recursively and then produce a total count of all objects, both files and directories, at the end:
$ gsutil ls -lR gs://pub
104413 2011-04-03T20:58:02Z gs://pub/SomeOfTheTeam.jpg
172 2012-06-18T21:51:01Z gs://pub/cloud_storage_storage_schema_v0.json
1379 2012-06-18T21:51:01Z gs://pub/cloud_storage_usage_schema_v0.json
1767691 2013-09-18T07:57:42Z gs://pub/gsutil.tar.gz
2445111 2013-09-18T07:57:44Z gs://pub/gsutil.zip
1136 2012-07-19T16:01:05Z gs://pub/gsutil_2.0.ReleaseNotes.txt
... <snipped> ...
gs://pub/apt/pool/main/p/python-socksipy-branch/:
10372 2013-06-10T22:52:58Z gs://pub/apt/pool/main/p/python-socksipy-branch/python-socksipy-branch_1.01_all.deb
gs://pub/shakespeare/:
84 2010-05-07T23:36:25Z gs://pub/shakespeare/rose.txt
TOTAL: 144 objects, 102723169 bytes (97.96 MB)
If you really just want the total, you can pipe the output to the tail command:
$ gsutil ls -lR gs://pub | tail -n 1
TOTAL: 144 objects, 102723169 bytes (97.96 MB)
UPDATE
gsutil now has a du command. This makes it even easier to get a count:
$ gsutil du gs://pub | wc -l
232

If you have the option to not use gsutil, the easiest way is to check it on Google Cloud Platform.
Go to Monitoring > Metrics explorer :
Resource type : GCS Bucket
Metric : Object count
Then, in the table below, you have for each bucket the number of document it contains.

You want to gsutil ls -count -recursive in gs://bucket/folder?
Alright; gsutil ls gs://bucket/folder/** will list just full urls of the paths to files under gs://bucket/folder without the footer or the lines ending in a colon. Piping that to wc -l will give you the line-count of the result.
gsutil ls gs://bucket/folder/** | wc -l

gsutil ls -lR gs://Floder1/Folder2/Folder3/** |tail -n 1

As someone that had 4.5M objects in a bucket, I used gsutil du gs://bucket/folder | wc -l which took ~24 min

This gist shows how to iterate through all Cloud Storage buckets and list the number of objects in each. Compliments of #vinoaj
for VARIABLE in $(gsutil ls)
do
echo $(gsutil du $VARIABLE | grep -v /$ | wc -l) $VARIABLE
done
To filter buckets, add a grep such as for VARIABLE in $(gsutil ls | grep "^gs://bucketname")
In the console, you can click Activate Cloud Shell in the top right and paste this in to get results. If you save the commands as a bash script, then run chmod u+x program_name so the script can run in the GCP Cloud Shell.
NOTE: When you do gsutil du gs://my-bucket/logs | wc -l the result includes an "extra" result for each bucket and sub-directory. For example, 3 files in a top-level bucket will be 4. 3 files in a sub-directory will be 5.

This doesn't work recursively, but you can also get the count of a single large folder from the console. This method has the advantage of being very fast.
Select Sort and filter from the filter menu in your bucket.
Reverse the sort order to let Google Cloud Storage calculate the number of files/folders.
View the count of files/folders in the current folder.

Related

How to get the files count in a variable of particular directory in GitHub actions YAML file?

How to get the files count in a variable of particular directory in GitHub actions YAML file ?
I need to get the exact files count in a directory and store the value in a variable. I have using the Github actions YAML file with a windows machine. So, need to get the result in this process.
Could you suggest a solution for this?
Assuming your filenames don't have newlines, you can do this, where $DIR is your directory:
$ find "$DIR" -maxdepth 1 -mindepth 1 | wc -l
13
If you want to store it in an environment variable, you could do this:
COUNT=$(find "$DIR" -maxdepth 1 -mindepth 1 | wc -l)
Note that you will need to use bash for this if you're on Windows.

gsutil command to delete old files from last day

I have a bucket in google cloud storage. I have a tmp folder in bucket. Thousands of files are being created each day in this directory. I want to delete files that are older than 1 day every night. I could not find an argument on gsutil for this job. I had to use a classic and simple shell script to do this. But the files are deleting very slowly.
I have 650K files accumulated in the folder. 540K of them must be deleted. But my own shell script worked for 1 day and only 34K files could be deleted.
The gsutil lifecycle feature is not able to do exactly what I want. He's cleaning the whole bucket. I just want to delete the files regularly at the bottom of certain folder.. At the same time I want to do deletion faster.
I'm open to your suggestions and your help. Can I do this with a single gsutil command? or a different method?
simple script I created for testing (I prepared to delete bulk files temporarily.)
## step 1 - I pull the files together with the date format and save them to the file list1.txt.
gsutil -m ls -la gs://mygooglecloudstorage/tmp/ | awk '{print $2,$3}' > /tmp/gsutil-tmp-files/list1.txt
## step 2 - I filter the information saved in the file list1.txt. Based on the current date, I save the old dated files to file list2.txt.
cat /tmp/gsutil-tmp-files/list1.txt | awk -F "T" '{print $1,$2,$3}' | awk '{print $1,$3}' | awk -F "#" '{print $1}' |grep -v `date +%F` |sort -bnr > /tmp/gsutil-tmp-files/list2.txt
## step 3 - After the above process, I add the gsutil delete command to the first line and convert it into a shell script.
cat /tmp/gsutil-tmp-files/list2.txt | awk '{$1 = "/root/google-cloud-sdk/bin/gsutil -m rm -r "; print}' > /tmp/gsutil-tmp-files/remove-old-files.sh
## step 4 - I'm set the script permissions and delete old lists.
chmod 755 /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/list1.txt /tmp/gsutil-tmp-files/list2.txt
## step 5 - I run the shell script and I destroy it after it is done.
/bin/sh /tmp/gsutil-tmp-files/remove-old-files.sh
rm -rf /tmp/gsutil-tmp-files/remove-old-files.sh
There is a very simple way to do this, for example:
gsutil -m ls -l gs://bucket-name/ | grep 2017-06-23 | grep .jpg | awk '{print $3}' | gsutil -m rm -I
There isn't a simple way to do this with gsutil or object lifecycle management as of today.
That being said, would it be feasible for you to change the naming format for the objects in your bucket? That is, instead of uploading them all under "gs://mybucket/tmp/", you could append the current date to that prefix, resulting in something like "gs://mybucket/tmp/2017-12-27/". The main advantages to this would be:
Not having to do a date comparison for every object; you could run gsutil ls "gs://mybucket/tmp/" | grep "gs://[^/]\+/tmp/[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}/$" to find those prefixes, then do date comparisons on the last portion of of those paths.
Being able to supply a smaller number of arguments on the command line (prefixes, rather than the name of each individual file) to gsutil -m rm -r, thus being less likely to pass in more arguments than your shell can handle.

How to find the max file size in a hdfs directory

I want to find the max size of files in a HDFS directory. Does anyone have any idea how to find it? I'm in Hadoop 2.6.0.
I found hadoop fs -ls -S /url which can Sort output by file size from Hadoop 2.7.0 document, but it's not supported in 2.6.0. So is there any similar function that can sort output files by size? Thank you!
You can make use of hdfs fsck command to get the file sizes.
For e.g., when I execute hdfs fsck /tmp/ -files, then I get the following output:
/tmp <dir>
/tmp/100GB <dir>
/tmp/100GB/Try <dir>
/tmp/100GB/Try/1.txt 5 bytes, 1 block(s): OK
/tmp/100GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/100GB/part-m-00000 107374182400 bytes, 800 block(s): OK
/tmp/100GB/part-m-00001._COPYING_ 44163923968 bytes, 330 block(s):
/tmp/10GB <dir>
/tmp/10GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/10GB/part-m-00000 10737418300 bytes, 81 block(s): OK
/tmp/1GB <dir>
/tmp/1GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/1GB/part-m-00000 1073741900 bytes, 9 block(s): OK
/tmp/1GB/part-m-00001 1073741900 bytes, 9 block(s): OK
It recursively lists all the files under /tmp along with their sizes.
Now, to parse out the file with max size, you can execute the following command:
hdfs fsck /tmp/ -files | grep "/tmp/" | grep -v "<dir>" | gawk '{print $2, $1;}' | sort -n
This command does the following:
hdfs fsck /tmp/ -files - It runs HDFS file system check on the folder /tmp/ and seeks report for each of the files under /tmp/
grep "/tmp/" - It greps for /tmp/ (the folder which we want to search). This will give only files and folders under /tmp/
"grep -v "<dir>"" - This removes the directories from the output (since we only want files)
gawk '{print $2, $1;}' - This prints the file size ($2), followed by the file name ($1)
sort -n - This does a numeric sort on the file size and the last file in the list should be the file with the largest size
You can pipe the output to tail -1 to get the largest file.
For e.g. I got output as:
107374182400 /tmp/100GB/part-m-00000
Try this to find which is max hdfs dfs -ls -h /path | sort -r -n -k 5
Please try below command.
hadoop fs -du Folder | sort -n -r | head -n 1

List and operate on files above a certain size?

How do you erase all files bellow a certain size with gsutil? We could use a script to filter output of gsutil ls but sounds overkill.
gsutil doesn't have any direct support (as in command line flags) for operating only on files below a given size. Instead you'd have to use a script, such as:
gsutil ls -l gs://your-bucket | awk '{if ($1 < 1024) print $NF}' | xargs some-command

How can I tell if a file is on a remote filesystem with Perl?

Is there a quick-and-dirty way to tell programmatically, in shell script or in Perl, whether a path is located on a remote filesystem (nfs or the like) or a local one? Or is the only way to do this to parse /etc/fstab and check the filesystem type?
stat -f -c %T <filename> should do what you want. You might also want -l
You can use "df -T" to get the filesystem type for the directory, or use the -t option to limit reporting to specific types (like nfs) and if it comes back with "no file systems processed", then it's not one of the ones you're looking for.
df -T $dir | tail -1 | awk '{print $2;}'
If you use df on a directory to get info only of the device it resides in, e.g. for the current directory:
df .
Then, you can just parse the output, e.g.
df . | tail -1 | awk '{print $1}'
to get the device name.
I have tested the following on solaris7,8,9 & 10 and it seems to be reliable
/bin/df -g <filename> | tail -2 | head -1 | awk '{print $1}'
Should give you have the fs type rather than trying to match for a "host:path" in your mount point.
On some systems, the device number is negative for NFS files. Thus,
print "remote" if (stat($filename))[0] < 0