How can I choose the right key-value store for my use case? - nosql

I will describe the data and case.
record {
customerId: "id", <---- indexed
binaryData: "data" <---- not indexed
}
Expectations:
customerId is random 10 digit number
Average size of binary record data - 1-2 kilobytes
There may be up to 100 records per one customerId
Overall number of records - 500M
Write pattern #1: insert one record at a time
Write pattern #2: batch, maybe in parallel, with speed of at least 20M record per hour
Search pattern #1: find all records by customerId
Search pattern #2: find of all records by customerId group, in parallel, at a rate of at least 10M customerId per hour
Data is not too important, we can trade some aspects of reliability for speed
We suppose to work in AWS / GCP - it's best we key-value store is administered by the cloud
We want to spend no more that 1K USD per month on cloud costs for this solution
What we have tried:
We have this approach implemented in relational database, in AWS RDS MariaDB. Server is 32GB RAM, 2TB GP2 SSD, 8 CPU. I found that IOPS usage was high and insert speed was not satisfactory. After investigation I concluded that due to random nature of customerId there is high rate of different writes to index. After this I did the following:
input data is sorted by customerId ASC
Additional trade was made to reduce index size with little degradation of single record read speed. For this I did some sort of buckets where records 1111111185 and 1111111186 go to same "bucket" 11111111. This way bucket can't contain more than 100 customerIds so read speed will be ok, and write speed improves.
Even like this, I could not make more than 1-3M record writes per hour. Different write concurrencies were tested, current value is 4 concurrent writers. After all modifications it's not clear what else we can improve:
IOPS is not at the top use (~4K per second),
CPU use is not high,
Network is not fully utilized,
Write and read throughputs are not capped.
Apparently, ACID principles are holding us back. I am in look for flatly scalable key-value store and will be glad to hear any ideas and roughly estimations.

So if I understand you...
2kb * 500m records ≈ 1 TB of data
20m writes/hr ≈ 5.5k writes/sec
That's quite doable in NoSQL.
The scale is not the issue. It's your cost.
$1k a month for 1 TB of data sounds like a reasonable goal. I just don't think that the public clouds are quite there yet.
Let me give an example with my recommendation: Scylla Cloud and Scylla Open Source. (Disclosure: I work for ScyllaDB.)
I will caution you that your $1k/month capitation on costs might cause you to consider and make some tradeoffs.
As is typical in high availability deployments, to ensure data redundancy in case of node failure, you could use 3x i3.2xlarge instances on AWS (can store 1.9 TB per instance).
You want the extra capacity to run compactions. We use incremental compaction, which saves on space amplification, but you don't want to go with the i3.xlarge (0.9 tb each), which is under your 1 tb limit unless really pressed for costs. In which case you'll have to do some sort of data eviction (like a TTL) to keep your data to around <600 gb.
Even with annual reserved pricing for Scylla Cloud (see here: https://www.scylladb.com/product/scylla-cloud/#pricing) of $764.60/server, to run the three i3.2xlarge would be $2,293.80/month. More than twice your budget.
Now, if you eschew managed services, and want to run self-service, you could go Scylla Open Source, and just look at the on-demand instance pricing (see here: https://aws.amazon.com/ec2/pricing/on-demand/). For 3x i3.2xlarge, you are running each at $0.624/hour. That's a raw on-demand cost of $449.28 each, which doesn't include incidentals like backups, data transfer, etc. But you could get three instances for $1,347.84. Open Source. Not managed.
Still over your budget, but closer. If you could get reserved pricing, that might just make it.
Edit: Found the reserve pricing:
3x i3.2xlarge is going to cost you
At monthly pricing $312.44 x 3 = $937.32, or
1 year up-front $3,482 annual/12 = $290.17/month/server x 3 = $870.50.
So, again, backups, monitoring, and other costs are above that. But you should be able to bring the raw server cost <$1,000 to meet your needs using Scylla Open Source.
But the admin burden is on your team (and their time isn't exactly zero cost).
For example, if you want monitoring on your system, you'll need to set up something like Prometheus, Grafana or Datadog. That will be other servers or services, and they aren't free. (The cost of backups and monitoring by our team are covered with Scylla Cloud. Part of the premium for the service.)
Another way to save money is to only do 2x replication. Which puts your data in a real risky place in case you lose a server. It is not recommended.
All of this was based on maximal assumptions of your data. That your records are all around 2k (not 1k). That you're not getting much utility out of data compression, which ScyllaDB has built in – see part one (https://www.scylladb.com/2019/10/04/compression-in-scylla-part-one/) and part two (https://www.scylladb.com/2019/10/07/compression-in-scylla-part-two/).
To my mind, you should be able to squeak through with your $1k/month budget if you go reserved pricing and open source. Though adding on monitoring and backups and other incidental costs (which I haven't calculated here) may end you up back over that number again.
Otherwise, $2.3k/month in a fully-managed-cloud enterprise package and you can sleep easy at night.

Related

Manage almost 3.5 PB/3500 TB of data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am looking at an opportunity wherein we will have to manage almost 3.5 PB/3500 TB of data. From what I have read so far, GreenPlum seems to be a good choice. Having said that I am struggling to find a good resource to give me an idea around the Hardware requirements. Following are the base inputs on which we are building this:
1). Data would be coming in at rate of 2 GBps (Giga BYTE)
2). The data is fairly simple with only one large table with 15 odd columns. Each column/record would be close to 2 KB
3). I would need indexing on 6 columns. Each of these columns is a varchar/string column
4). The use case is more write intensive and less of read intensive. The idea would be to process a set of 15-20 batch jobs per day. Real/Near real time analytics is not a necessity. This is more from a reporting purpose.
5). Data is time series data and would be required for a month. So data older than a month would be purged.
What I have udnerstood so far is that GreenPlum recommends 2x8 cores(Threads) and 256 GB RAM per host. Also each host should typically looks at 24 slots of hard disk. If I look at ESAS with 4TB each I should be able to host 96 TB / host.If I assume a simple linear extrapolation I would be looking at (3500 / 96 ) 37 nodes.
Now I know its not that simple/linear a calculation. Hence, I wanted to udnerstand if there is any calculator/resource/guidelines to size the database cluster. Also I wanted to know if it is ok to not give dedicated disks to the server and rather use a single SAN storage. Each server can have 2x10G links to make sure easy data transfers between nodes and SAN.
Many Thanks.
Abhi
You likely won't need indexes because of the architecture of Greenplum. You will just need to use good partitioning which your design suggests.
Leveraging the cluster for data transformation is also a great idea and a common use case for Greenplum.
What I have udnerstood so far is that GreenPlum recommends 2x8 cores(Threads) and 256 GB RAM per host.
By "threads", I'm assuming you mean "Segments" and your statement isn't completely accurate. The number of Segments per host depends on how much RAM, cores, and disk performance each Segment host will have in addition to the level of concurrency.
I would use a minimum of 8GB of RAM, 4 cores, and 100 MB/s of disk performance per segment. And 100 MB/s of disk performance is definitely on the low end. It will be a balancing act of getting the number of segments per host right.
One way to do this is with the TPC-DS benchmark. https://github.com/pivotalguru/tpc-ds
You can run the benchmark, get the results, reinitialize the cluster to use more/less segments per host, and then run the test again. You can also set the level of concurrency in the test that best matches what you are expecting.
Also I wanted to know if it is ok to not give dedicated disks to the server and rather use a single SAN storage. Each server can have 2x10G links to make sure easy data transfers between nodes and SAN.
SAN is usually configured with IOPs in mind and not throughput which Greenplum needs. With that much data, you are usually better off with DAS. Some cloud vendors provide pretty good throughput to their SAN solutions but you end up having to run many, small nodes to get the overall throughput you want.

MongoDB: Disk I/O % utilization on Data Partition has gone

Last time I get alert from MongoDB Atlas:
Disk I/O % utilization on Data Partition has gone above 70 on nvme2n1
But I have no any ideas how can I localize / query / index / part of code / problematic collection.
In what way can I perform any analyze to find out problem root-cause?
Not answer, but just seen that many people faced with similar problem.
In My case root cause was: we had collection with huge documents that contain array of data (in fact - list of coordinates with some metadata), and update it as many times, as coordinates we have (when adding new coordinates). + some additional operations.
As I know MongoDB cannot fetch just part of document, it fetch full document, and when we fetch many different and big documents, they are not fit into MongoDB in-memory cache, and each time access into hard disc, that lead to this issue.
So, we just split up this document on several, and this fixed issue. While we need frequent access to update/add this data, we keep it into different documents, and finally, after process done, we gather back all this documents into one big document, for "history check" purpose.
Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90 after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.
Understanding Disk I/O % Utilization
The definition of Disk I/O % Utilization and Disk I/O % utilization on Data Partition per doc
Disk I/O % Utilization alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.
Disk I/O % utilization on Data Partition occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.
Two traps in iostat: %util and svctm
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util, as the disk was in use 100% of that time.
Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS capacity.
Having disk utilization at 100% does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100% utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100% interval.
Conclusion
We should look at a combination of all the available disk-related metrics, as well as IOWait in the System CPU when diagnosing potential disk performance-related issues.
Possible actions to help resolve Disk Utilization % alerts
Optimize your queries
Create an Index to Support Read Operations
Pay attention to Query Selectivity and Covered Query
Use the Atlas Performance Advisor to view slow queries and suggested indexes.
Review Indexing Strategies for possible further indexing improvements.
Analyze Query Performance to review how your queries are using your indexes.
Analyze Profile to optimize the long execution time query
Increase hardware resources, such as instance size and IOPS on Atlas
Source: Mongo Doc
As the alert says, it is due to the high utilization of the disk. The most common cause of it is unoptimized queries with poor Query Targeting Ratio, or simply reading/writing a lot of documents from/to the disk in a relatively shorter time window.
In order to identify these queries, start with the Profiler and look for the operations with a poor Examined:Returned ratio. You can also refer to the Performance Advisor to see if it suggests any indexes on the inefficient operations. Since Profiler's window is limited to the last 24 hours, you can also refer to your logs to identify the Slow Queries.
Ultimately, the effort to solve this is tri-directional:
Optimizing the query execution with efficient indexing and filtering strategies
Keep a check on the volume of data being read/written in one go.
Increase the IOPS of the cluster
For official reference, checkout the documentation here.

Azure Service Fabric reliable collections and memory

Let's say I'm running a Service Fabric cluster on 5 D1 class (1 core, 3.5GB RAM, 50GB SSD) VMs. and that I'm running 2 reliable services on this cluster, one stateless and one stateful. Let's assume that the replica target is 3.
How to calculate how much can my reliable collections hold?
Let's say I add one or more stateful services. Since I don't really know how the framework distributes services do I need to take most conservative approach and assume that a node may run all of my stateful services on a single node and that their cumulative memory needs to be below the RAM available on a single machine?
TLDR - Estimating the expected capacity of a cluster is part art, part science. You can likely get a good lower bound which you may be able to push higher, but for the most part deploying things, running them, and collecting data under your workload's conditions is the best way to answer this question.
1) In general, the collections on a given machine are bounded by the amount of available memory or the amount of available disk space on a node, whichever is lower. Today we keep all data in the collections in memory and also persist it to disk. So the maximum amount that your collections across the cluster can hold is generally (Amount of available memory in the cluster) / (Target Replica Set Size).
Note that "Available Memory" is whatever is left over from other code running on the machines, including the OS. In your above example though you're not running across all of the nodes - you'll only be able to get 3 of them. So, (unrealistically) assuming 0 overhead from these other factors, you could expect to be able to put about 3.5 GB of data into that stateful service replica before you ran out of memory on the nodes on which it was running. There would still be 2 nodes in the cluster left empty.
Let's take another example. Let's say that it is about the same as your example above, except in this case you set up the stateful service to be partitioned. Let's say you picked a partition count of 5. So now on each node, you have a primary replica and 2 secondary replicas from other partitions. In this case, each partition would only be able to hold a maximum of around 1.16 GB of state, but now overall you can pack 5.83 GB of state into the cluster (since all nodes can now be utilized fully). Incidentally, just to prove out the math works, that's (3.5 GB of memory per node * 5 nodes in the cluster) [17.5] / (target replica set size of 3) = 5.83.
In all of these examples, we've also assumed that memory consumption for all partitions and all replicas is the same. A lot of the time that turns out to not be true (at least temporarily) - some partitions can end up with more or less work to do and hence have uneven resource consumption. We also assumed that the secondaries were always the same as the primaries. In the case of the amount of state, it's probably fair to assume that these will track fairly evenly, though for other resource consumption it may not (just something to keep in mind). In the case of uneven consumption, this is really where the rest of Service Fabric's Cluster Resource Management will help, since we can come to know about the consumption of different replicas and pack them efficiently into the cluster to make use of the available space. Automatic reporting of consumption of resources related to state in the collections is on our radar and something we want to do, so in the future, this would be automatic but today you'd have to report this consumption on your own.
2) By default, we will balance the services according to the default metrics (more about metrics is here). So by default, the different replicas of those two different services could end up on the machine, but in your example, you'll end up with 4 nodes with 1 replica from a service on it and then 1 node with two replicas from the two different services. This means that each service (each with 1 partition as per your example) would only be able to consume 1.75 GB of memory in each service for a total of 3.5 GB in the cluster. This is again less than the total available memory of the cluster since there are some portions of nodes that you're not utilizing.
Note that this is the maximum possible consumption, and presuming no consumption outside the service itself. Taking this as your maximum is not advisable. You'll want to reduce it for several reasons, but the most practical reason is to ensure that in the presence of upgrades and failures that there's sufficient available capacity in the cluster. As an example, let's say that you have 5 Upgrade Domains and 5 Fault Domains. Now let's say that a fault domain's worth of nodes fails while you have an upgrade going on in an upgrade domain. This means that (a little less than) 40% of your cluster capacity can be gone at any time, and you probably want enough room left over on the remaining nodes to continue. This means that if your cluster previously could hold 5.83 GB of state (from our prior calculations), in reality you probably don't want to put more than about 3.5 GB of state in it since with more of that the service may not be able to get back to 100% healthy (note also that we don't build replacement replicas immediately so the nodes would have to be down for your ReplicaRestartWaitDuration before you ran into this case). There's a bunch more information about metrics, capacity, buffered capacity (which you can use to ensure that room is left on nodes for the failure cases) and fault and upgrade domains are covered in this article.
There are some other things that practically will limit the amount of state you'll be able to store. You'll want to do several things:
Estimate the size of your data. You can make a reasonable estimate up-front of how big your data is by calculating the size of each field your objects hold. Be sure to take into consideration 64-bit references. This will give you a lower-bound starting point.
Storage overhead. Each object you store in a collection will come with some overhead for storing that object. In the reliable collections depending on the collection and the operations currently in flight (copy, enumerations, updates, etc.) this overhead can range from between 100 and around 700 bytes per item (row) stored in the collections. Do know also that we're always looking for ways to reduce the amount of overhead we introduce.
We also strongly recommend running your service over some period of time and measuring actual resource consumption via performance counters. Simulating some sort of real workload and then measuring the actual usage of the metrics you care about will serve you pretty well. The reason we recommend this in particular is that you will be able to see consumption from things like which CLR object heap your objects end up placed in, how often GC is running, if there's leaks, or other things like this which will impact the amount of memory you can actually utilize.
I know that this has been a long answer but I hope you find it helpful and complete.

Efficiently checking for a rare occurrence

I have to process many millions of data records. A data record has a record-type string at the beginning of a record. Processing is record-type-dependent but does not require to 'if'/'elsif' the type, just selecting an array-slice mask from a hash.
However, on the order of once-per-million I might encounter a record type that require a totally different kind of processing.
I hate to insert an 'if' testing for this record type that will return 'true' so rarely.
Any suggestions?
Thanks
Meir
The answer is: Don't worry about it.
The speed of your CPU is considerably higher than that of your disk IO, so an if test is just not going to make a lot of difference - even if you ignored e.g. branch prediction algorithms.
An SSD will do about 1500 IO operations per second, and to quote Borodin from the comments:
A reasonable average disk read speed is 100MB per second. Say your records are 100 bytes each, that means you can read 1 million records per second, or 1μs per record. A 2011 Intel Core i5 processor runs at 83,000 MIPS, and so can
execute 83,000 instructions in the time taken to read one record. It is pointless to avoid a few test and branch instructions amongst all that.
Basically this is true in any code - your IO to storage is almost always your limiting factor, because CPUs have followed Moore's law, but the actual rotational speed of a spinning disk hasn't really changed in 15+ years. SSDs are something of a revolutionary change, but they're still too expensive to use as bulk storage options (and even if that wasn't true, they're still going to be the bottleneck on a sustained data transfer/processing operation).

Does auto-sharding in MongoDB work on shards with many small collections/small databases

In the MongoDB documentation for auto-sharding it says: "Sharding is performed on a per-collection basis. Small collections need not be sharded."
Our business has many databases (~100), with many small collections (~30), each with a document count of 1 - 3000. Our DB system is looking at approximately 100,000,000 page views per month.
In that scenario will sharding ever activate since the collections are never big enough even though the DB usage and site traffic is certainly high enough to require load balancing. From the docs I can't seem to find a clear answer.
Whether it makes sense to shard depends a little bit on whether you have mostly writes or reads to the database. Sharding is primarily used for write-scaling, but if you are not doing a lot of writes, then simply using replicasets with "slaveOkay" for the reads might work just as well.
From the numbers that you provided you seem to get about 9 million documents, but are they large documents? If they easily fit in memory, then there is most likely not even going to be a need for replicasets besides for failover capabilities.
This is hard to answer without knowing more about your use case, but I'll give it a shot.
Are you sure sharding is what you need? What does your insert rate look like?
If you are going to have a static set of data, or even a relatively static set, then you probably don't need to shard, you could simply use more secondaries and enable slaveOK reads. The reads will be distributed to the various secondaries and scale up your read capacity.
If that is not the case, and you do need to shard, then there are options. But first, to explain briefly and at a high level how automatic sharding works:
The mongos process is responsible for splitting and migrating chunks in general. These are two separate operations - splitting and balancing.
Splits occur when the mongos sees that a certain portion of the
maximum chunk size has been written, it initiates a split if there is
in fact enough data to warrant it. Over time, with enough data
written, the number of chunks grows.
Balancing occurs when there is an imbalance of chunks (currently 8 in
2.0, though moving to a more dynamic heuristic in 2.2). The balancer migrates the chunks around the shards until a balance is achieved.
So, you need to be writing enough data relative to the max chunk size (default is 64MB in 2.0) to generate the chunks needed for the balancer to move them around appropriately. If that is not going to happen with your data, then you can look at:
Decreasing the chunk size (has drawbacks too - http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-ChunkSizeConsiderations)
Manually split/move the chunks
For the manual instructions see:
http://www.mongodb.org/display/DOCS/Splitting+Shard+Chunks
http://www.mongodb.org/display/DOCS/Moving+Chunks