Sharding key, chunkSize and pre-splitting - mongodb

I have set up a sharded cluster on a single machine, following the steps mentioned here:
http://www.mongodb.org/display/DOCS/A+Sample+Configuration+Session
But I don't understand the '--chunkSize' option:
$ ./mongos --configdb localhost:20000 --chunkSize 1 > /tmp/mongos.log &
With N shards, each shard is supposed to have 1/N number of documents, dividing the shard-key's range into N almost equal parts, right? This automatically fixes the chunkSize/shard-size. Which chunk is the above command then dealing with?
Also, there's provision to split a collection manually at a specific value of key and then migrate a chunk to any other shard you want. This can be done manually and is even handled by a 'balancer' automatically. Doesn't it clash with the sharding settings and confuse the config servers or they are reported about any such movement immediately?
Thanks for any help.

You might be confusing a few things. The --chunkSize parameter sets the chunk size for the doing splits. The "settings" collection in the "config" database with _id "chunksize" to have a look at the current value, if set. The --chunkSize option will only set this value, or make changes to the system, if there is no value set already, otherwise it will be ignored.
The chunk size is the size in megabytes above which the system will keep chunk under. This is done in two places, 1) when writes pass through the mongos instances and 2) prior to moving chunks to another shard during balancing. As such it does not follow from the "data size / shard count" formula. Your example of 1Mb per chunk is almost always a bad idea.
You can indeed split and move chunks manually and although that might result in a less than ideal chunk distribution it will never confuse or break the config meta data and the balancer. The reason is relatively straightforward; the balancer uses the same commands and follows the same code paths. From MongoDB's perspective there is no significant difference between a balancer process splitting and moving chunks and you doing it.
There are a few valid use cases for manually splitting and moving chunks though. For example, you might want to do it manually to prepare a cluster for very high peak loads from a cold start -- pre-splitting. Typically you will write a script to do this, or load splits from a performance test which already worked well. Also, you may watch for hot chunks to split/move those chunks to move evenly distribute based on "load" as monitored from your application.
Hope that helps.

Great, thanks! I think I get it now..Correct me if I'm wrong:I was thinking that if there are N servers, then first 1/Nth part of the collection (=chunk1) will go to shard1, the second 1/Nth (=chunk2) will go to shard2 and so on.. When you said that there's no such "formula", I searched a bit more and found these links MongoDB sharding, how does it rebalance when adding new nodes?How to define sharding range for each shard in Mongo?From the definition of "chunk" in the documentation, I think it is to be thought of as merely a unit of data migration. When we shard a collection among N servers, then the total no. of chunks is not necessarily N. And they need not be of equal size either. The maximum size of one chunk is either already set as a default (usually 64MB) in the settings collection of config database, or can be set manually by specifying a value using the --chunkSize parameter as shown in the above code. Depending on the values of the shard-key, one shard may have more chunks than the other. But MongoDB uses a balancer process that tries to distribute these chunks evenly among the shards. By even distribution, I mean it tends to split chunks and migrate them to other shards if they grow bigger than their limit or if one particular shard is getting heavily loaded. This can be done manually as well, by following the same set of commands that the balancer process uses.

Related

What are settings to lookout for with Citus PostgresQL

We are looking into using CitusDB. After reading all the documentation we are not clear on some fundamentals. Hoping somebody can give some directions.
In Citus you specify a shard_count and a shard_max_size, these settings are set on the coordinator according to the docs (but weirdly can also be set on a node).
What happens when you specify 1000 shards and distribute 10 tables with 100 clients?
Does it create a shard for every table (users_1, users_2, shops_1, etc.) (so effectively using all 1000 shards.
If you would grow with another 100 clients, we already hit the 1000 limit, how are these tables partitioned?
The shard_max_size defaults to 1Gb. If a shard is > 1Gb a new shard is created, but what happens when the shard_count is already hit?
Lastly, is it advisible to go for 3000 shards? We read in the docs 128 is adviced for a saas. But this seams low if you have 100 clients * 10 tables. (I know it depends.. but..)
Former Citus/current Microsoft employee here, chiming in with some advice.
Citus shards are based on integer hash ranges of the distribution key. When a row is inserted, the value of the distribution key is hashed, the planner looks up what shard was assigned the range of hash values that that key falls into, then looks up what worker the shard lives on, and then runs the insert on that worker. This means that the customers are divided up across shards in a roughly even fashion, and when you add a new customer it'll just go into an existing shard.
It is critically important that all distributed tables that you wish to join to each other have the same number of shards and that their distribution columns have the same type. This lets us perform joins entirely on workers, which is awesome for performance.
If you've got a super big customer (100x as much data as your average customer is a decent heuristic), I'd use the tenant isolation features in advance to give them their own shard. This will make moving them to dedicated hardware much easier if you decide to do so down the road.
The shard_max_size setting has no effect on hash distributed tables. Shards will grow without limit as you keep inserting data, and hash-distributed tables will never increase their shard count under normal operations. This setting only applies to append distribution, which is pretty rarely used these days (I can think of one or two companies using it, but that's about it).
I'd strongly advise against changing the citus.shard_count to 3000 for the use case you've described. 64 or 128 is probably correct, and I'd consider 256 if you're looking at >100TB of data. It's perfectly fine if you end up having thousands of distributed tables and each one has 128 shards, but it's better to keep the number of shards per table reasonable.

GridFS: what it gives us

I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!

MongoDB Replica Set: Disk size difference in Primary and Secondary Nodes

I just did the mongodb replica set configuration and all looks good. All data moved to secondary nodes properly. But when I looked at the data directory, I can see Primary have ~140G of data and at the same time secondary has only ~110G.
Did anyone come across this kind of issue while setting up the Replica Set. Is that something normal behavior?
When you do an initial sync from scratch on a secondary, it writes all the data fresh. This removes padding, empty space (deleted data) etc. As a result, in that respect it is similar to running a repair.
If you ran a repair on the primary (blocking operation, only to be done if absolutely necessary), then the two would be far closer overall.
If you check the output from db.stats() you should see that the various databases have the same object count, the data directory size differences are nothing to be worried about.

Is MongoDB usable as shared memory for a parallell processing / multiple-instances application?

I'm planning a product that will process updates from multiple data feeds. Input-data is guesstimated to be a total of 100Mbps stream containing 100 byte sized messages. These messages contain several data fields that needs to be checked for correlation with the existing data set within the application. If a input-message correlates with an existing data record, then the input-message will update the existing data-record, if not: it will create a new record. It is assumed that data are updated every 3 seconds in average.
The correlation process is assumed to be a bottleneck, and thus I intend to make our product able to run balanced in multiple processes if needed (most likely on a separate hardware or VM). Somewhat in the vicinity of Space-based architecture. I'd then like a shared storage between my processes so that all existing data records are visible to all the running processes. The shared storage will have to fetch possible candidates for correlation through a query/search based on some attributes (e.g. elevation). It will have to offer configuring warm redundancy, and a possibility to store snapshots every 5 minutes for logging.
Everything seems to be pointing towards MongoDB, but I'd like a confirmation from you that MongoDB will meet my needs. So do you think it is a go?
-Thank you
NB: I am not considering a relational database because we want to focus all coding in our application, instead of having to make 'stored procedures'/'functions' in a separate environment to optimize the performance of our system. Further, the data is diverse and I don't want to try normalize it into a schema.
Yes, MongoDB will meet your needs. I think the following aspects of your description are particularly relevant in your DB selection decision:
1. An update happens every 3 seconds
MongoDB has a database level write-lock (usually short lived) that blocks read operations. This means that you want will want to ensure that you have enough memory to fit your working set, and you will generally not run into any write-lock issues. Note that bulk inserts will hold the write lock for longer.
If you are sharding, you will want to consider shard keys that allow for write scaling i.e. distribute writes on different shards.
2. Shared storage for multiple processes
This is a pretty common scenario; in fact, many MongoDB deployments are expected be accessed from multiple processes concurrently. Unlike the write-lock, the read-lock does not block other reads.
3. Warm redundancy
Supported through MongoDB replication. If you'd like to read from secondary server(s) you will need to set the Read Preference to secondaryPreferred in your driver.

Cassandra random read speed

We're still evaluating Cassandra for our data store. As a very simple test, I inserted a value for 4 columns into the Keyspace1/Standard1 column family on my local machine amounting to about 100 bytes of data. Then I read it back as fast as I could by row key. I can read it back at 160,000/second. Great.
Then I put in a million similar records all with keys in the form of X.Y where X in (1..10) and Y in (1..100,000) and I queried for a random record. Performance fell to 26,000 queries per second. This is still well above the number of queries we need to support (about 1,500/sec)
Finally I put ten million records in from 1.1 up through 10.1000000 and randomly queried for one of the 10 million records. Performance is abysmal at 60 queries per second and my disk is thrashing around like crazy.
I also verified that if I ask for a subset of the data, say the 1,000 records between 3,000,000 and 3,001,000, it returns slowly at first and then as they cache, it speeds right up to 20,000 queries per second and my disk stops going crazy.
I've read all over that people are storing billions of records in Cassandra and fetching them at 5-6k per second, but I can't get anywhere near that with only 10mil records. Any idea what I'm doing wrong? Is there some setting I need to change from the defaults? I'm on an overclocked Core i7 box with 6gigs of ram so I don't think it's the machine.
Here's my code to fetch records which I'm spawning into 8 threads to ask for one value from one column via row key:
ColumnPath cp = new ColumnPath();
cp.Column_family = "Standard1";
cp.Column = utf8Encoding.GetBytes("site");
string key = (1+sRand.Next(9)) + "." + (1+sRand.Next(1000000));
ColumnOrSuperColumn logline = client.get("Keyspace1", key, cp, ConsistencyLevel.ONE);
Thanks for any insights
purely random reads is about worst-case behavior for the caching that your OS (and Cassandra if you set up key or row cache) tries to do.
if you look at contrib/py_stress in the Cassandra source distribution, it has a configurable stdev to perform random reads but with some keys hotter than others. this will be more representative of most real-world workloads.
Add more Cassandra nodes and give them lots of memory (-Xms / -Xmx). The more Cassandra instances you have, the data will be partitioned across the nodes and much more likely to be in memory or more easily accessed from disk. You'll be very limited with trying to scale a single workstation class CPU. Also, check the default -Xms/-Xmx setting. I think the default is 1GB.
It looks like you haven't got enough RAM to store all the records in memory.
If you swap to disk then you are in trouble, and performance is expected to drop significantly, especially if you are random reading.
You could also try benchmarking some other popular alternatives, like Redis or VoltDB.
VoltDB can certainly handle this level of read performance as well as writes and operates using a cluster of servers. As an in-memory solution you need to build a large enough cluster to hold all of your data in RAM.