How to set the block(chunk) size for a Google Cloud Store Bucket? - google-cloud-storage

I am reading this article. It said "To demonstrate this, let’s take an 8MB file, and fetch it through different sized chunks. In the graph below, we can see that as the block size gets larger, performance improves. As the chunk size decreases, the overhead per-transaction increases, and performance slows down." in section "Set your optimal fetch size".
However, I try to check the Google Cloud Storage settings but there are no "block size", "chunk size" or "fetch size". Where is such a setting?

A part of the answer is in the answer, this parameter
GSUtil:sliced_object_download_max_components=8
It allows you to download in a number of slice. In the given example, it is used on a 8Mb file and therefore you can deduce the size of each chunk. The size of the chunks aren't important in the article, but the HTTP overhead incurs: the more chunks you have, the more HTTP request you get, and therefore, the more HTTP overhead you pay (in time manner)

Related

Firebase Storage ~ max write rate per document?

For firestore I find under the 'soft limits' section in the docs following info
Maximum sustained write rate to a document - 1 per second
Sustaining a write rate above once per second increases latency and causes contention errors. This is not a hard limit, and you can surpass the limit in short bursts.
I have a rather big file (~800KB) in firestore at the moment which I'm quite frequently writing what gives me a warning (not as often as one time per second, but I think that might be due to the size...) and I'm wondering if it was better to switch to storage. I can't find any infos for storage though. Is it more 'robust' and no such restrictions to care about?

Chunk counld not split when the chunksize greater than specified chunksize

Here is the situation:
There is a chunk, has the shard key range [10001, 100030], but currently, it has only one key (e.g. 10001) has the data, key range from [10002, 10030] is just empty, the chuck data is beyond 8M, then we set the current chuck size to 8M.
After we fill the data in the key range [10002, 10030], this chunk starts to split, and stopped at a key range like this `[10001, 10003], it has two keys, and we just wonder if this is OK or not.
From the document on the official site we thought that the chunk might NOT contains more than ONE key.
So, would you please help us make sure if this is ok or not ?
What we want to is to split the chunks as many as possible, so that to make sure the data is balanced.
There is a notion called jumbo chunks. Every chunk which exceeds its specified size or has more documents than the maximum configured is considered a jumbo chunk.
Since MongoDB usually splits a chunk when about half the chunk size is reached, I take Jumbo chunks as a sign that there is something rather wrong with the cluster.
The most likely reason for jumbo chunks is that one or more config servers wasn't available for a time.
Metadata updates need to be written to all three config servers (they don't build a replica set), metadata updates can not be made in case one of the config servers is down. Both chunk splitting and migration need a metadata update. So when one config server is down, a chunk can not be split early enough and it will grow in size and ultimately become a jumbo chunk.
Jumbo chunks aren't automatically split, even when all three config servers are available. The reason for this is... Well, IMHO MongoDB plays a little save here. And Jumbo chunks aren't moved, either. The reason for this is rather obvious - moving data which in theory can have any size > 16MB simply is a too costly operation.
Proceed at your own risk! You have been warned!
Since you can identify the jumbo chunks, they are pretty easy to deal with.
Simply identify the key range of the chunk and use it within
sh.splitFind("database.collection", query)
This will identify the shard in question and split in half, which is quite important. Please, please read Split Chunks in a Sharded Cluster and make sure you understood all of it and the implications before trying to split the chunks manually.

Performance benchmarks for attaching read-only disks to google compute engine

Has anyone benchmarked the performance of attaching a singular, read-only disk to multiple Google Compute Engine instances (i.e., the same disk in read-only mode)?
The Google documentation ( https://cloud.google.com/compute/docs/disks/persistent-disks#use_multi_instances ) indicates that it is OK to attach multiple instances to the same disk, and personal experience has shown it to work at a small scale (5 to 10 instances), but soon we will be running a job across 500+ machines (GCE instances). We would like to know how performance scales out as the number of parallel attachments grows, and as the bandwidth of those attachments grows. We currently pull down large blocks of data (read-only) from Google Cloud Storage Buckets, and are wondering about the merits of switching to a Standard Persistent Disk configuration. This involves Terabytes of data, so we don't want to change course, willy-nilly.
One important consideration: It is likely that code on each of the 500+ machines will try to access the same file (400MB) at the same time. How do buckets and attached drives compare in that case? Maybe the answer is obvious - and it would save having to set up a rigorous benchmarking system (across 500 machines) ourselves. Thanks.
Persistent disks on GCE should have consistent performance. Currently that is 12MB/s and 30IOPS per 100GB of volume size for a standard persistent disk:
https://cloud.google.com/compute/docs/disks/persistent-disks#pdperformance
Using it on multiple instances should not change the disk's overall performance. It will however make it easier to use those limits since you don't need to worry about using the instance's maximum read speed. However, accessing the same data many times at once might. I do know how either persistent disks or GCS handle contention.
If it is only a 400MB file that are in contention, it may make sense to just benchmark the fastest method to deliver this separately. One possible solution is to make duplicates of your critical file and pick which one you access at random. This should cause less nodes to contend for each file.
Duplicating the critical file means a bigger disk and therefore also contributes to your IO performance. If you already intended to increase your volume size for better performance, the copies are free.

Optimal setting for socket send buffer for sending image files on Netty

I'm looking into implementing a HTTP image server in Netty and I was wondering what the optimal Send Buffer Size should be:
ie. bootstrap.setOption("sendBufferSize", 1048576);
I've read How to write a high performance Netty Client but I was wondering what are the consequences of setting this value too small or too large.
The images I serve are mostly around 100K to 5MB (avg 1MB). I'm thinking a large (1MB) value would cause the OS memory to be filled with TCP buffered data but is there a performance penalty of setting it to a small value (ie. 8192KB)?
You might find this article useful.
You might also need to adjust the size of the peer's receive buffer size. In this case, please make sure you set it to the server socket. Otherwise, you'll be capped at 64KiB.

what constitutes "large amount of write activity" for Mongodb?

I am currently working on an online ordering application using Mongodb as the backend. In looking into sharding, the Mongo docs say that you should consider sharding if
"your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other approaches have not reduced contention."
So my question is: what constitutes a large amount of write activity? are we talking 1000's of writes per second? 100's?
I know that sharding introduces a level of infrastructure complexity that I'd rather not get into if I don't have to.
thanks!
R
The "large amount of write activity" is not defined in terms of a specific number .. but rather when your common usage pattern exceeds the resources of your server hardware. For example, when average I/O flush time or iowait indicates that I/O has become a significant limiting factor.
You do have other options to consider before sharding:
if your working set is larger than RAM and you have significant page faults, upgrade your RAM
if your disk I/O isn't keeping up, consider upgrading to faster disks, RAID, or SSD
review and adjust your readahead settings
look into optimization of slow or inefficient queries
review your indexes and remove unnecessary ones