How to organize large number of objects in cloud storage? - google-cloud-storage

I'm looking for suggestions on how to organize large number of objects.
Assuming the incoming rate is about 60,000,000 files per day and I would like to keep them for 180 days.
With hourly partition, there will be 4320 (24 * 180) directories at the top level. And each directory will contain ~2,500,000 files on average.
If I only need to fetch the files individually by its full path and I do not need to list the content of the directory, is there any issue with leaving all 2500000 of them in the same level?
Or should I hash the filenames and store them in multiple sub directories? (like how it's typically done if stored on a traditional file system)

There's no limit on the number of objects you can store in a bucket, and breaking objects into more "subdirectories" doesn't make any scalability or performance difference. To the Google Cloud Storage service all object names are flat: the "/" in the path just looks like any other character in the object name.
Mike Schwartz, Google Cloud Storage Team

Related

Is there an optimal way for writing lots of tiny files with PySpark?

I have a job that requires having to write a single JSON file to s3 for each row in a Spark dataframe (which then gets picked up by another process).
df.repartition(col("id")).write.mode("overwrite").partitionBy(col("id")).json(
f"s3://bucket/path/to/file"
)
These datasets often consist of 100k rows (sometimes 1m+) and take a very long time to write. I understand that large numbers of small files is not great for read performance but is this also the case for writes? Or is there something that can be done with partitioning to speed things up?
Please don't do this, you will only suffer pain. S3 was designed to be cheap long-term storage optimized for large files. It was design so the 'prefix' (directory path) leads to a bucket that provides files. If you want to optimize reads and writes you want to develop several buckets to write to at the same time. This means you want to actually modify the directory path(prefix) to the bucket with the most amount of variation to increase the number of buckets that you write to.
Example of mulitple files being written to the same bucket:
S3:/mydrive/mystuff/2020-12-31
S3:/mydrive/mystuff/2020-12-30
S3:/mydrive/mystuff/2020-12-29
This is because they all share the same bucket prefix --> S3:/mydrive/mystuff/
What if instead you flipped the part that changes? Now you have different buckets being used as you are writing to different buckets.(prefix is different)
S3:2020-12-31/mydrive/mystuff/
S3:2020-12-30/mydrive/mystuff/
S3:2020-12-29/mydrive/mystuff/
This change will help with read/write speed as different buckets will be used. It will not solve the problem that S3 doesn't actually use directories to direct you to files. As I said a prefix is actually just a pointer to the bucket. It then searches against all files you have written, to find the file that exists in your bucket. This is why tons of small files makes things worse, the lookup time for files takes longer and longer the more files you write. Because this lookup is expensive it's much faster to write larger files and make the cost of lookup minimized.

What is "Globally" Unique for GCS Bucket Names?

Question
When Google Cloud Storage says that all GCS Buckets share a common namespace (paragraph 2 here)
When you create a bucket, you specify a globally-unique name, a geographic location where the bucket and its contents are stored, and a default storage class. The default storage class you choose applies to objects added to the bucket that don't have a storage class specified explicitly.
and (bullet 1 here)
Bucket names reside in a single Cloud Storage namespace.
This means that:
Every bucket name must be unique.
Bucket names are publicly visible.
If you try to create a bucket with a name that already belongs to an existing bucket, Cloud Storage responds with an error message. However, once you delete a bucket, you or another user can reuse its name for a new bucket. The time it takes a deleted bucket's name to become available again is typically on the order of seconds; however, if you delete the project that contains the bucket, which effectively deletes the bucket as well, the bucket name may not be released for weeks or longer.
does "single namespace" and "globally" literally mean that across the entire Google Cloud regardless of your organization, project, or region you cannot create any bucket that shares a name with another bucket anywhere else on the entire planet for all existing buckets at any given time?
I have only ever worked on GCP within one organization, and we prefix our buckets with the organization name lots of times but also sometimes we don't. I am not concerned with running out of names or anything like that, I am more just curious what is meant by those things "globally" and "single namespace" and if it means what I think it does.
Given that most times buckets are only referenced by their name with the gs:// prefix I can see how having literal global uniqueness is important for ensuring consistent access experiences without needing to know things like project/organization IDs. Can anybody find a source that confirms this?
Odd Implication/thought experiment
If this is the case something I do wonder given...
There appears to be no cost associated to creating an empty bucket you do not use up to 5k buckets a month, more than that is $0.05 per 10k buckets (source)
There is no limit to the number of buckets you can create in a project (source)
... what is to stop me from creating a free-tier project(s), and iterating over ALL possible GCS bucket names (obviously it would take forever and be quite impractical) but in theory just occupying all bucket names (or at least all human readable ones) and selling those names to individuals who wish to purchase them for their organization if the bucket name does not already exist? I suppose the number of possibilities is astronomical meaning even at $0.05, for it to be profitable one would need to know the ratio of names that will be bought to the list of available ones to determine rational pricing, and even at $0.05 per 10k (with the first 5k a month being free) there is not enough money in the world to create all of the buckets at once. Still, I think about these things.
Yes, "single namespace" and "globally" mean what you said: All GCS buckets must have unique names, regardless of organization, project, and region.

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Google Cloud Storage cost effectiveness for small files?

I have a lot of small unstructured json files (less than 1K each) I want to store on Google cloud storage somehow (using streaming). I would prefer to avoid putting them into zip files (I think) since I'm thinking of using Apache Drill to perform queries against them. Would it be more cost effective to merge multiple json documents together rather than storing them one by one? (I assume that writing the files in batches would be a good thing regardless if they're merged or stored separately)
Well...maybe. It depends on your usage pattern.
GCS does not have a per-object charge. Instead, it charges per Gigabyte stored per month. Breaking the files up won't affect that at all.
However, GCS also charges a per-operation fee. At time of writing, every 10,000 downloads will cost you a penny, and every 10,000 uploads will cost you a dime. If you only have a few thousand files or only access a few files at a time, this might not make a big difference, but if you need to download all of the files frequently, or if you need to replace them frequently, and you're doing millions or billions of separate uploads per day, suddenly using a few big files instead could save you a lot of money.
If you can estimate how many downloads and uploads you'll be doing under each scenario, Google provides a calculator to let you know what it will cost: https://cloud.google.com/products/calculator/

Is it better to store 1 email/file in Google Cloud Storage or multiple emails in one large file?

I am trying to do analytics on emails for some users. To achieve this, I am trying to store the emails on Cloud Storage so I can run Hadoop jobs on them. (Earlier I tried App Engine DataStore but it was having a hard time scaling over that many user's data: hitting various resource limits etc.)
Is it better to store 1 email/file in Cloud Storage or all of a user's emails in one large file? In many examples about cloud storage, I see folks operating on large files, but it seems more logical to keep 1 file/email.
From a GCS scaling perspective there's no advantage to storing everything in one object vs many objects. However, listing the objects in a bucket is an eventually consistent operation. So, if your computation would proceed by first uploading (say) 1 million objects to a bucket, and then immediately starting a computation that lists the objects in the bucket and computing over their content, it's possible the listing would be incomplete. You could address that problem by maintaining a manifest of objects you upload and passing the manifest to the computation instead of having the computation list the objects in the bucket. Alternatively, if you load all the emails into a single file and upload it, you wouldn't need to perform a bucket listing operation.
If you plan to upload the data once and then run a variety of analytics computations (or rev a single computation and run it a number of times), uploading a large number of objects and depending on listing the bucket from your analytics computation would not be a problem, since the eventual consistency problem really only impacts you in the case where you list the bucket shortly after uploading.