How to calculate filtered file size in google cloud storage - google-cloud-storage

I have a folder hierarchy Bucket/folder/year/month/date/files.ext e.g 2021/12/31/abc.html and 2022/1/1/file1.html etc. The folder contains millions of html files and images. I only want to calculate the sum of size filtered by .html extensions only, the year will start from 2019 to 2022 and for each month and date.
right now what I'm using
gsutil du gs://Bucket/folder/*/*/*/*.html | wc -l
I couldn't find any better solution it is taking too long and gives the connection to your Google Cloud Shell was lost. And second thing is that I want to delete all html files in 2019/1/1/file1.html

Unfortunately, I think you're already looking at the right answer. GCS doesn't provide any sort of index that'll quickly calculate total file size by file type.
Cloud Shell will time out after some minutes of inactivity, or after 24 total hours, so if you have millions of files and need this to complete, I would suggest starting a small GCE instance and running the command from there, or running gsutil from your own machine.

Related

How to query the github file size in Google BigQuery?

I need to get the size statistics for the files in the github open source repository.
For example, the number of files less than 1M is XXX or 70% of the total files.
I found that the files in [bigquery-public-data.github_repos.contents] are all less than 1M(though I don't know why). So I decided to choose [githubarchive:month.202005] or other month.
But I didn't find the "file size" field in [githubarchive:month.202005].So I would like to ask how to query the size of the file in [githubarchive:month.202005]? Then I can use the method in this to get the results by size??
I am new to bigquery, and the question may be silly. But I really need a solution. Or have statistics or literature that I can cite, which has the size statistics for files on github. [bigquery-public-data.github_repos.contents] does not mention why only files less than 1M were selected.
I guess you have a wrong interpretation, since bigquery-public-data.github_repos.content public table holds text file data in content column for items under 1 MiB on the HEAD branch, for others you'll discover just null values:
SELECT id,size,content FROM `bigquery-public-data.github_repos.contents` where size > 1048576 LIMIT 100
Therefore, you are not limited analyzing files inventory in this case if I properly understand your point.

How to Process multiple files in talend one after another and the size of the files are too large?

i want to process the multiple files using talend and one after another and the size of the files are large and while processing one file if another file comes into that directory it has to process that file also.
is there any possible way to do this could you please suggest guys?
You can use tFileList component, which will iterate all the files in a given directory.
You can check the component functionality here
Simple concept would be,
When there is a file in a directory say Folder1, move that file to another location say Folder2.
After processing file in Folder2 again, check Folder1, that is any new files arrived.
If arrived, then again move that file to Folder2 and process it.
If there is no new file, end the job.
A great way to do this in Talend is to setup a file watcher job which is simple to do. Talend provides the tWaitForFile Component which will watch a directory for files. You can configure the max iterations in which it will look for the file and time between polls/scans. Since you said you are loading large files, to avoid DB concurrency issues give enough time between scans to account for this.
In my example below I am watching a directory for new files, scanning every 60 seconds over an 8 hour period. You would want to schedule the job in either the TAC or whatever scheduling tool you use. In my example I simply join to a tJavaRow and display the information about the file that was found.
you can see the output from my tJavaRow here which shows the file info:

Google nearline pricing on overwrites

I have Google Nearline storage set up and working fine via gcloud/gsutil.
So far I have been using rsync to back some databases up eg...
rsync -d -R /sourcedir/db_dir gs://backup_bucket/
Currently the files are datastamped in the filename, so we get a different filename every day.
I've just spotted the mention of early deletion charges (currently on trial).
I'm assuming whenever I delete a file with -d, I will get charged for that file up to 30 days ? If so, there's no point deleting it before then (but will get charged).
But if I keep the filename the same, but overwrite the file with the latest days backup, the text says...
"if you create an object in a bucket configured for Nearline, and 10 days later you overwrite it, the object is considered an early deletion and you will be charged for the remaining 20 days of storage."
So I'm a bit unclear, if I have a file and overwrite it with a new version, am I then charged again for each file/day, every time its updated as well as the new file ?
eg, for one file, backed up daily via rsync (assuming same filename this time)...over 30 days
day1 myfile is created
day2 myfile is updated
day3 myfile is updated
... and so on
Am I now being charged (filespaceday1 * 30days) + (filespaceday2 * 29days) + (filespaceday3 * 28) and so on... just for the one file (rather than filespace * 30 days)?
Or does it just mean, if I create a 10gig file, and overwrite it with a 2meg file, I will be charged for 10gig for the 30 days (and ignore the 2meg file costs) ?
If so, are there any best practices for rsync and keeping charges down ?
Overwriting an object in GCS is equivalent to deleting the old object and inserting a new object in its place. You are correct that overwriting an object does incur the early delete charge, and so if you were to overwrite the same file every day, you would be charged for 30 days of storage every day.
Nearline storage is primarily meant for objects that will be retained for a long time and infrequently read or modified, and it's priced accordingly. If you want to modify an object on a daily basis, standard or durable reduced availability would likely be a cheaper option.

Fastest way to get Google Storage bucket size?

I'm currently doing this, but it's VERY slow since I have several terabytes of data in the bucket:
gsutil du -sh gs://my-bucket-1/
And the same for a sub-folder:
gsutil du -sh gs://my-bucket-1/folder
Is it possible to somehow obtain the total size of a complete bucket (or a sub-folder) elsewhere or in some other fashion which is much faster?
The visibility for google storage here is pretty shitty
The fastest way is actually to pull the stackdriver metrics and look at the total size in bytes:
Unfortunately there is practically no filtering you can do in stackdriver. You can't wildcard the bucket name and the almost useless bucket resource labels are NOT aggregate-able in stack driver metrics
Also this is bucket level only- not prefixes
The SD metrics are updated daily so unless you can wait a day you cant use this to get the current size right now
UPDATE: Stack Driver metrics now support user metadata labels so you can label your GCS buckets and aggregate those metrics by custom labels you apply.
Edit
I want to add a word of warning if you are creating monitors off of this metric. There is a really crappy bug with this metric right now.
GCP occasionally has platform issues that cause this metric to stop getting written. And I think it's tenant specific (maybe?) so you also won't see it on their public health status pages. And it seems poorly documented for their internal support staff as well because every time we open a ticket to complain they seem to think we are lying and it takes some back and forth before they even acknowledge its broken.
I think this happens if you have many buckets and something crashes on their end and stops writing metrics to your projects. While it does not happen all the time we see it several times a year.
For example it just happened to us again. This is what I'm seeing in stack driver right now across all our projects:
Response from GCP support
Just adding the last response we got from GCP support during this most recent metric outage. I'll add all our buckets were accessible it was just this metric was not being written:
The product team concluded their investigation stating that this was indeed a widespread issue, not tied to your projects only. This internal issue caused unavailability for some GCS buckets, which was affecting the metering systems directly, thus the reason why the "GCS Bucket Total Bytes" metric was not available.
Unfortunately, no. If you need to know what size the bucket is right now, there's no faster way than what you're doing.
If you need to check on this regularly, you can enable bucket logging. Google Cloud Storage will generate a daily storage log that you can use to check the size of the bucket. If that would be useful, you can read more about it here: https://cloud.google.com/storage/docs/accesslogs#delivery
If the daily storage log you get from enabling bucket logging (per Brandon's suggestion) won't work for you, one thing you could do to speed things up is to shard the du request. For example, you could do something like:
gsutil du -s gs://my-bucket-1/a* > a.size &
gsutil du -s gs://my-bucket-1/b* > b.size &
...
gsutil du -s gs://my-bucket-1/z* > z.size &
wait
awk '{sum+=$1} END {print sum}' *.size
(assuming your subfolders are named starting with letters of the English alphabet; if not; you'd need to adjust how you ran the above commands).
Use the built in dashboard
Operations -> Monitoring -> Dashboards -> Cloud Storage
The graph at the bottom shows the bucket size for all buckets, or you can select an individual bucket to drill down.
Note that the metric is only updated once per day.
With python you can get the size of your bucket as follows:
from google.cloud import storage
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_or_name='name_of_your_bucket')
blobs_total_size = 0
for blob in blobs:
blobs_total_size += blob.size # size in bytes
blobs_total_size / (1024 ** 3) # size in GB
Google Console
Platform -> Monitoring -> Dashboard -> Select the bucket
Scroll down can see the object size for that bucket
I found that that using the CLI it was frequently timing out. But that my be as I was reviewing a coldline storage.
For a GUI solution. Look at Cloudberry Explorer
GUI view of storage
For me following command helped:
gsutil ls -l gs://{bucket_name}
It then gives output like this after listing all files:
TOTAL: 6442 objects, 143992287936 bytes (134.1 GiB)

How do you deal with lots of small files?

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.
I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.
Is there a way to optimize NTFS or Windows so that it can work with all these small files?
NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.
A reboot is required before this setting will take effect.
NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.
For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.
The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.
One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:
ABC\
DEF\
ABCDEFGHI.db
EFG\
ABCEFGHIJ.db
Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.
I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.
I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.
You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.
In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.
You could try using something like Solid File System.
This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.
http://www.eldos.com/solfsdrv/
If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.
Also, could you move files older than say, a year, to a different (but still accessible) location?
Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.
One common trick is to simply create a handful of subdirectories and divvy up the files.
For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.
Aside from placing the files in sub-directories..
Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.
So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.
Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.
If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.
The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:
1.xml
24.xml
12331.xml
2304252.xml
you can sort them into directories like so:
data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml
This scheme will ensure that you will never have more than 100 files in each directory.
Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?
If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.
The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).
Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.
Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.
Rename the folder each day with a time stamp.
If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.
Then you will get one folder for each day, each containing several thousand files.
You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.
You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.
To create a folder structure that will scale to a large unknown number of files, I like the following system:
Split the filename into fixed length pieces, and then create nested folders for each piece except the last.
The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.
12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg
This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.
And here's a beautiful PowerShell one-liner to get you going!
$s = '123456'
-join (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )