Google nearline pricing on overwrites - google-cloud-storage

I have Google Nearline storage set up and working fine via gcloud/gsutil.
So far I have been using rsync to back some databases up eg...
rsync -d -R /sourcedir/db_dir gs://backup_bucket/
Currently the files are datastamped in the filename, so we get a different filename every day.
I've just spotted the mention of early deletion charges (currently on trial).
I'm assuming whenever I delete a file with -d, I will get charged for that file up to 30 days ? If so, there's no point deleting it before then (but will get charged).
But if I keep the filename the same, but overwrite the file with the latest days backup, the text says...
"if you create an object in a bucket configured for Nearline, and 10 days later you overwrite it, the object is considered an early deletion and you will be charged for the remaining 20 days of storage."
So I'm a bit unclear, if I have a file and overwrite it with a new version, am I then charged again for each file/day, every time its updated as well as the new file ?
eg, for one file, backed up daily via rsync (assuming same filename this time)...over 30 days
day1 myfile is created
day2 myfile is updated
day3 myfile is updated
... and so on
Am I now being charged (filespaceday1 * 30days) + (filespaceday2 * 29days) + (filespaceday3 * 28) and so on... just for the one file (rather than filespace * 30 days)?
Or does it just mean, if I create a 10gig file, and overwrite it with a 2meg file, I will be charged for 10gig for the 30 days (and ignore the 2meg file costs) ?
If so, are there any best practices for rsync and keeping charges down ?

Overwriting an object in GCS is equivalent to deleting the old object and inserting a new object in its place. You are correct that overwriting an object does incur the early delete charge, and so if you were to overwrite the same file every day, you would be charged for 30 days of storage every day.
Nearline storage is primarily meant for objects that will be retained for a long time and infrequently read or modified, and it's priced accordingly. If you want to modify an object on a daily basis, standard or durable reduced availability would likely be a cheaper option.

Related

How to calculate filtered file size in google cloud storage

I have a folder hierarchy Bucket/folder/year/month/date/files.ext e.g 2021/12/31/abc.html and 2022/1/1/file1.html etc. The folder contains millions of html files and images. I only want to calculate the sum of size filtered by .html extensions only, the year will start from 2019 to 2022 and for each month and date.
right now what I'm using
gsutil du gs://Bucket/folder/*/*/*/*.html | wc -l
I couldn't find any better solution it is taking too long and gives the connection to your Google Cloud Shell was lost. And second thing is that I want to delete all html files in 2019/1/1/file1.html
Unfortunately, I think you're already looking at the right answer. GCS doesn't provide any sort of index that'll quickly calculate total file size by file type.
Cloud Shell will time out after some minutes of inactivity, or after 24 total hours, so if you have millions of files and need this to complete, I would suggest starting a small GCE instance and running the command from there, or running gsutil from your own machine.

Can someone give me a rough guideline for how long it will take to delete a Nearline storage bucket?

4 million JPG files, approximately 30TB in size. I deleted it via their web interface, and it currently states "Deleting 1 bucket", and has done for an hour.
Just after someone's experience for a rough estimation as to how long this operation will take - another hour? A day? A week?!
Region: europe-west1, if that makes a difference.
Thank you!
According to this documentation on the deletion request timeline, on step 2 it says that:
Once the deletion request is made, data is typically marked for deletion immediately and our goal is to perform this step within a maximum period of 24 hours.
A couple of points to be also considered are that:
This timeline will vary depending on the number of files, so your case might take longer that that.
If you files are organized in different folders, it would take longer to delete them since the system would have to enter each directory to delete.
One thing that you could do to speed up the deletion process is to use this command for parrallel deletion:
gsutil rm -m gs://bucket
NOTE: I don't think that the fact that your storage is a nearline storage has any effect on the timeline of deletion but I could not find any confirmation for that on the documentation.

GRIB files with incremental updates

Folks,
I am new to dealing with GRIB format and seek your advice on the following question:
we have an application where we plan to receive data at every 6 hour interval. The forecast will be for next 10 to 15 days.
There is a requirement where to reduce the download size, the system should only download incremental changes meaning the new GRIB files will only contain data which has changed.
So all the previously downloaded GRIB files should display data and for the parts where there was a change (assuming clients will know) the client will downloaded and display the GRIB file which has this incremental update ..
Is this kind of incremental changes to GRIB supported by standard?
I suspect this option is not supported by GRIB files. As the data in GRIB files is packed, you cannot know what variables have changed and which not.
In addition, most likely most of the parameters have a slight and insignificant change between the forecasts (I mean the forecast for let us say 07:00 o'clock done at 00:00 and done at 06:00 will have differences for most of the parameters, but they can be in order of 10^-X - meaning they are insignificant). Some parameters or regions of course might have larger differences that you would like to highlight.

DB2 AS400/IBM ISeries Triggers/On File Change

Looking for best practices to get DELTAs of data over time.
No timestamps available, cannot program timestamps!
GOAL: To get differences in all files for all fields over time. Only need primary key as output. Also I need this for 15 minute intervals of data changes
Example:
Customer file has 50 columns/fields, if any field changes I want another file to record the primary key. Or anything to record the occurrence of a change in the customer file.
Issue:
I am not sure if triggers are the way to go since there is a lot of overhead associated with triggers.
Can anyone suggest best practices for DB2 deltas over time with consideration to overhead and performance?
I'm not sure why you think there is a lot of overhead associated with triggers, they are very fast in my experience, but as David suggested, you can journal the files you want to track, then analyze the journal receivers.
To turn on Journaling you need to perform three steps:
Create a receiver using CRTJRNRCV
Create a journal for the receiver using CRTJRN
Start journaling on the files using STRJRNPF. You will need to keep *BEFORE and *AFTER images to detect a change on update, but you can omit *OPNCLS records to save some space.
Once you do this, you can also use commitment control to manage transactions! But, you will now have to manage those receivers as they use a lot of space. You can do that by using MNGRCV(*SYSTEM) on the CRTJRN command. I suspect that you will want to prevent the system from deleting the old receivers automatically as that could cause you to miss some changes when the system changes receivers. But that means you will have to delete old receivers on your own when you are done with them. I suggest waiting a day or two to delete old receivers. That can be an overnight process.
To read the journal receiver, you will need to use RTVJRNE (Retreive Journal Entries) which lets you retrieve journal entries into variables, or DSPJRN (Display Journal) which lets you return journal entries to the display, a printer file, or an *OUTFILE. The *OUTFILE can then be read using ODBC, or SQL or however you want to process it. You can filter the journal entries that you want to receive by file, and by type.
Have you looked at journalling the files and evaluating the journal receivers?

Is it possible to take incremental back up of Database in Openerp?

Daily I am able to take the back up of PostgreSQL Database in Openerp by using Cron job. Every day database dump is coming around 50 MB. I want to take it daily which means 50 mb each day, which will consume large amount to hard disk space. I don't want it to happen. I want to take incremental Database back up every day.Can any one help me.Thank's in advance.
Check the format of the backups to ensure OpenERP is compressing them. If not you can do it manually or use pg_dump or pgadmin3 to do the backups.
If you want point in time backups then you will need to get tricky with postgres with check points and log shipping and I would think carefully before going down that road.
The other thing to note is OpenERP stores the attachments in the database so if you have a lot of attached documents or emails with attachments, these will be in the ir_attachments table and cause your backups to grow quickly.