Google Storage - Backup Bucket with Different Key For Each File - google-cloud-storage

I need to backup a bucket in which every file is encrypted with a different key to a different bucket on Google Storage.
I want to create a daily snapshot of the data so in a case where the data has been deleted I could easily recover it.
My Research:
Using gsutil cp -r - because every file has a different key it does not work
Using Google Transfer | cloud - does not work on such buckets from the same reason
List all the files in the bucket and fetch all the keys from the database and copy each file - this will probably be very expensive to do because i have a lot of files and i want to do it daily
Object versioning - Does not cover a case where the bucket has been completely deleted
Are there any other solutions for that problem?

Unfortunately, as you mentioned the only option indeed, would be to follow your number 3 choice. As you said and as clarified in this official documentation here, download of encrypted data is a restricted feature, so you won't be able to download/snapshot the data, without fetching the keys and then copying the files.
Indeed, this will probably make a huge impact in your quota and pricing, since you will be performing multiple operations everyday, for multiple files, which will affect multiple aspects on the pricing. However, this seems to be the only available way right now. In addition to this, I would recommend you to raise a Feature Request in Google's Issue Tracker, so they can check about the possibility of implementing this in the future.
Let me know if the clarifed your doubts!

Related

What is the quickest way to delete many blobs from GCS?

I have a bucket containing many millions of blobs that I want to delete however I can't simply delete the bucket. This is the best method I have come up with to delete millions of blobs in the quickest time possible:
gsutil ls gs://bucket/path/to/dir/ | xargs gsutil -m rm -r
For what I want to do (which involves removing about 30million blobs) it still takes many hours to run, partly I guess because its at the mercy of the speed of my broadband connection.
Anyone know of a quicker way of achieving this? I had kinda hoped it'd be an instantaneous operation as in the backend the location could simply be marked as deleted - clearly not.
Google recommend using the console to do this
The Cloud Console can bulk delete up to several million objects and does so in the background. The Cloud Console can also be used to bulk delete only those objects that share a common prefix, which appear as part of a folder when using the Cloud Console.
https://cloud.google.com/storage/docs/best-practices#deleting
That said (personal opinion here) using the console might be quicker but you haven't a clue how far its got. At least with cli option you do know.
Another alternative is using lifecycle management to delete based on rules:
Delete objects in bulk
If you want to bulk delete a hundred thousand or more objects, avoid
using gsutil, as the process takes a long time to complete. Instead,
use the Google Cloud console, which can delete up to several million
objects, or Object Lifecycle Management, which can delete any number
of objects.
To bulk delete objects in your bucket using Object Lifecycle
Management, set a lifecycle configuration rule on your bucket where
the condition has Age set to 0 days, and the action is set to delete.
From: https://cloud.google.com/storage/docs/deleting-objects#delete-objects-in-bulk
However this won't work if you're in a rush:
After you have added or edited a rule, it may take up to 24 hours to take effect.

Is there a less costly option than gsutil rsync for backup to cloud storage?

We are currently utilizing the new coldline storage to backup files off site, the storage part is super cost effective. We are using gsutil rsync once a day to make sure our coldline storage is up to date.
The problem is that using gsutil rsync creates a massive number of class A requests, which are quite expensive. In this case it would be at least 5x the amount of the coldline storage making it no longer a good deal.
Are we going to have to custom code a custom solution to avoid these charges, is there a better option for this type of back, or is there some way get rsync to not generate so many requests?
I think there is one pricing trick you can use if the expense is from the storage.objects.list operation. From the GCS operations pricing page: "When an operation applies to a bucket, such as listing the objects in a bucket, the default storage class set for that bucket determines the operation cost. So the trick is:
Set the bucket's default storage class to STANDARD
On upload, set the object's storage class to ARCHIVE (using gsutil cp -s ARCHIVE, or by setting the appropriate option on the upload requesting using the API).
As far as I understand: this now means you will be charged at the STANDARD rate for listing the bucket ($0.05/10k operations instead of $0.50/10k operations).
The one challenge is that gsutil rsync does not support the -s ARCHIVE flag that is supported by gsutil cp, so you won't be able to use it. You might want to look at other tools, like possibly rclone.

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.

Simple version-control systems or versioning file system or versioning database

I am looking for a simple versioning system for a large number of records or files (~50 million, ~100GB unpacked, ~20MB packed). The files are only a few Kilobytes each, and have unique IDs, so I don't mind whether they are stored in a flat structure (table, directory...) or not. On average, each record is changed once a month, but most changes have diffs less than a Kilobyte so it should be easy to compress versions. However, a naive database with one entry for each version would grow too quickly. I need the following operations:
basic CRUD operations: create, read, update, delete
quick listing of recent changes
quick listing of recent changes of a particular record
query for changes in a given period of time
query for changes by a given user (each edit is associated to some user id and optionally has a commit message as comment)
for write operations there must be a commit hook to validate and reject illformed records.
In short, I am looking for a Wiki-like software for simple records or files.
I thought about possible solutions:
Put files in a version control system. This gives me replication and many available access tools, so it is my preferred solution. But the amount of data is too large for distributed systems like git. Is anyone using Subversion for a similar task with success?
Implement my own versioning in a database or in a file system. I would pobably need to store only compressed records and diffs, would have more work and learn something. This would be my preferred solution, if it was just for fun.
Use a versioning file system. This would make setup, replication and access more difficult. Probably I would need to implement my own access API above the file system.
Use a versioning database system. Can you suggest some?
Use some other existing data store with versioning (MediaWiki?, Amazon Cloud Drive?, ...)
Obviously there are many pathes. Which pathes have been used by others with success for similar or larger amounts of data?
If you're not averse to having a raw copy of each file on your client (which I imagine is OK, if you're considering svn) then git is probably quite a good solution to your problem. The underlying repository storage will use binary diffs between files as well as between versions, so you should have close to optimal compression there.
With a bare repo and some scripting, you may even be able to get away with not having the current revision checked out: objects are available from the command line and you can create new commits without a checkout.

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.