Google Cloud cloud storage operation costs - google-cloud-storage

I am looking into using Google Cloud cloud storage buckets as a cheaper alternative to compute engine snapshots to store backups.
However, I am a bit confused about the costs per operation. Specifically the insert operation. If I understand the documentation correctly, it doesn't seem that it matters how large the file is that you want to insert is, it always counts as 1 operation.
So if I upload a single 20 TB file using one insert to a standard storage class bucket, wait 14 days, then retrieve it again, and all this within the same region, I practically only pay for storing it for 14 days?
Doesn't that mean that even the standard storage class bucket is a more cost effective option for storing backups compared to snapshots, as long as you can get your whole thing into a single file?

It's not fully accurate, and all depends on what cost for you.
First of all, the maximum size of an object in Cloud Storage is 5 TiB, so you can't store 1 file of 20Tb, but 4, at the end, it's the same principle.
The persistent disk snapshot is a very powerful feature:
The snapshot doesn't need CPUs to be done, compared to your solution.
The snapshot doesn't need network bandwidth to be done, compared to your solution.
The snapshot can be done anytime, on the fly.
The snapshot can be restored in the current VM, or you can create a new VM with a snapshot to investigate on it, for example.
You can perform incremental snapshots saving money (cheaper than full image snapshot).
You don't need additional space on your persistent disk to be done (compared to your solution where you need to create an archive before sending it to Cloud Storage).
In your scenario seems like using snapshots seems like the best solution in terms of time efficiency. Now, is using Cloud Storage a cheaper solution? Probably, as it is listed as the most affordable storage option, but in the end, you will have to calculate the cost-benefits on your own.

Related

Is the storage and compute decoupled in modern cloud data warehouses?

In Redshift, Snowflake, and Azure SQL DW, do we have storage and compute decoupled?
If they are decoupled, is there any use of "External Tables" still or they are gone?
When Compute and Storage were tightly coupled, and when we wanted to scale, we scaled both compute and storage. But under the hoods, was it a virtual machine and we scaled the compute and the VMs disks? Do you guys have maybe some readings on this?
Massive thanks, I am confused now and it would be a blessing if someone could jump in to explain!
You have reason to be confused as there is a heavy layer of marketing being applied in a lot of places. Let's start with some facts:
All databases need local disk to operate. This disk can store permanent versions of the tables (classic locally stored tables and is needed to store the local working set of data for the database to operate. Even in cases where no tables are permanently stored on local disk the size of the local disks is critical as this allows for date fetched from remote storage to be worked upon and cached.
Remote storage of permanent tables comes in 2 "flavors" - defined external tables and transparent remote tables. While there are lots of differences in how these flavors work and how each different database optimizes them they all store the permanent version of the table on disks that are remote from the database compute system(s).
Remote permanent storage comes with pros and cons. "Decoupling" is the most often cited advantage for remote permanent storage. This just means that you cannot fill up the local disks with the storage of "cold" data as only "in use" data is stored on the local disks in this case. To be clear you can fill up (or brown out) the local disks even with remote permanent storage if the working set of data is too large. The downside of remote permanent storage is that the data is remote. Being across a network to some flexible storage solution means that getting to the data takes more time (with all the database systems having their own methods to hide this in as many cases as possible). This also means that the coherency control for the data is also across the network (in some aspect) and also comes with impacts.
External tables and transparent remote tables are both permanently stored remotely but there are differences. An external table isn't under the same coherency structure that a fully-owned table is under (whether local or remote). Transparent remote just implies that the database is working with the remote table "as if" it is locally owned.
VMs don't change the local disk situation. An amount of disk is apportioned to each VM in the box and an amount of local disk is allocated to each VM. The disks are still local, it's just that only a portion of the physical disks are addressable by any one VM.
So leaving fact and moving to opinion. While marketing will tell you why one type of database storage is better than the other in all cases this just isn't true. Each has advantages and disadvantages and which is best for you will depend on what your needs are. The database providers that offer only one data organization will tell you that this is the best option, and it is for some.
Local table storage will always be faster for those applications where speed of access to data is critical and caching doesn't work. However, this means that DBAs will need to do the work to maintain the on-disk data is optimized and fits is the available local storage (for the compute size needed). This is real work and takes time an energy. What you gain in moving remote is the reduction of this work but it comes at the cost of some combination of database cost, hardware cost, and/or performance. Sometimes worth the tradeoff, sometimes not.
When it comes to the concept of separating (or de-coupling) Cloud Compute vs. Cloud Storage, the concepts can become a little confusing. In short, true decoupling generally requires object level storage vs. faster traditional block storage (traditionally on-premises and also called local storage). The main reason for this is that object storage is flat, without a hierarchy and therefore scales linearly with the amount of data you add. It therefore winds up also being cheaper as it is extremely distributed, redundant, and easily re-distributed and duplicated.
This is all important because in order to decouple storage from compute in the cloud or any large distributed computing paradigm you need to shard (split) your data (storage) amongst your compute nodes... so as your storage grows linearly, object storage which is flat -- allows that to happen without any penalty in performance -- while you can (practically) instantly "remaster" your compute nodes so that they can evenly distribute the workload again as you scale your compute up or down or to withstand network/node failures.

Cloud SQL disk size is much larger than actual database

Cloud SQL reports that I've used ~4TB of SSD storage, but my database is only ~225 GB. What explains this discrepancy? Is there something I can delete to free up space? If I moved it to a different instance, would the required storage go down?
There are a couple of options about why your Cloud SQL storage has increase:
-Did you enable Point-in-time recovery? PITR uses write-ahead logs and if you enabled this feature, that could be the reason why of your increases.
-Have you used temporary tables and you have not deleted them?
If none of the above applies to you, I highly recommend you to open a case with GCP support team so that they take a look at your Cloud SQL instance.
On the other hand, you should open a case to decrease the disk size to a smaller one so it won’t be necessary to create a new instance and copy all the data to that new instance in addition that shrinking the disk is done at Google's end making the effort from you the lowest possible.
A maintenance window can be scheduled where Google can proceed with this task and you may want to schedule a maintenance window to minimize the impact of the downtime. For this case it is necessary to know the new disk size and when you would like to perform this operation.
Finally, if you prefer to use the migration method, you should export the DB, then create the new instance, import the DB and synchronize the old one with the new one to have all the data in both instances to which can take several hours to complete those four steps.
You do not specify what kind of database. In my case, for a MySQL database, there were several hundred GB as binary logs (mysql flag).
You could check with:
SHOW BINARY LOGS;

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Data Factory Copy Activity Blob -> ADLS

I have files that accumulate in Blob Storage on Azure that are moved each hour to ADLS with data factory... there are around 1000 files per hour, and they are 10 to 60kb per file...
what is the best combination of:
"parallelCopies": ?
"cloudDataMovementUnits": ?
and also,
"concurrency": ?
to use?
currently i have all of these set to 10, and each hourly slice takes around 5 minutes, which seems slow?
could ADLS, or Blob be getting throttled, how can i tell?
There won't be a one solution fits all scenarios when it comes to optimizing a copy activity. However there few things you can checkout and find a balance. A lot of it depends on the pricing tiers / type of data being copied / type of source and sink.
I am pretty sure that you would have come across this article.
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance
this is a reference performance sheet, the values are definitely different depending on the pricing tiers of your source and destination items.
Parallel Copy :
This happens at the file level, so it is beneficial if your source files are big as it chunks the data (from the article)
Copy data between file-based stores Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
The default value is 4.
behavior of the copy is important. if it is set to mergeFile then parallel copy is not used.
Concurrency :
This is simply how many instances of the same activity you can run in parallel.
Other considerations :
Compression :
Codec
Level
Bottom line is that you can pick and choose the compression, faster compression will increase network traffic, slower will increase time consumed.
Region :
the location or region of that the data factory, source and destination might affect performance and specially the cost of the operation. having them in the same region might not be feasible all the time depending on your business requirement, but definitely something you can explore.
Specific to Blobs
https://learn.microsoft.com/en-us/azure/storage/common/storage-performance-checklist#blobs
this article gives you a good number of metrics to improve performance, however when using data factory i don't think there is much you can do at this level. You can use the application monitoring to check out throughput while your copy is going on.

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.