I have a spark dataframe with around 43 million records, which i'm trying to write to a Mongo Collection.
When I write it to a unsharded collection, the output records are same as i'm trying to insert. But when i write the same data to a sharded collection (hashed), the number of records increase by 3 millinos.
What's interesting is that the number of records keep on fluctuating even after my spark job has been completed. (no other connections to it)
When i did the same with range sharded collection, the number of records were consistent.
(edit: even with range sharded cluster, it started fluctuating after a while)
Can someone help me understand why this is happening? and also, i'm sharding my collection as i've to write about 300 Billion records everyday, and I want to increase my write throughputs; so any other suggestion would be really appreciated.
I have 3 shards, each replicated on 3 instances
I'm not using any other option in the spark mongo connector, only using ordered=False
Edit:
The count of records seemed to stabalize after a few hours with the correct number of records, still it would be great if someone could help me understand why mongo exhibits this behaviour
The confusion is the differences between collection metadata and logical documents while there is balancing in progress.
The bottom line is you should use db.collection.countDocuments() if you need an accurate count.
Deeper explanation:
When MongoDB shards a collection it assigns a range of the documents to each shard. As you insert documents these ranges will usually grow unevenly, so the balancer process will split ranges into smaller ones when necessary to keep their data size about the same.
It also moves these chunks between shards so that each shard has about the same number of chunks.
The process of moving a chunk from one shard to another involves copying all of the documents in that range, verifying they have all been written to new shard, then deleting them from the old shard. This means that the documents being moved will exist on both shards for a while.
When you submit a query via mongos, the shard will perform a filter stage to exclude documents in chunks that have not been fully move to this shard, or have not been deleted after fully moving out a chunk.
To count documents with the benefit of this filter, use db.collection.countDocuments()
Each mongod maintains metadata for each collection it holds, which includes a count of documents. This count is incremented for each insert and decremented for each delete. The metadata count can't exclude orphan documents from incomplete migrations.
The document count returned by db.collection.stats() is based on the metadata. This means if the balancer is migrating any chunks the copied but not yet deleted documents will be reported by both shards, so the overall count will be higher.
Related
I want to shard my MongoDB database. I have a high insert rate and want to distribute my documents on two shards evenly.
I have considered rangebase sharding, because I have range queries; but I can not find a solution for picking a good shard key.
{
Timestamp : ISODate("2016-10-02T00:01:00.000Z"),
Machine_ID: "100",
Temperature:"50"
}
If this is my document and I have 100,000 different machines, would the Machine_ID be a suitable shardkey? And if so, how will MongoDB distribute it on the shards, i.e. do i have to specify the shard range myself? like put Machine_ID 0-49,999 on shard A, and 50,000-100,000 on shard B?
I think the Machine_ID would be a suitable shard key if your queries afterwards will be per Machine, i.e. get all the temperatures for a specific machine for a certain time range. Reading more about shard keys can be found here: Choosing shard key
MongoDB has two kinds of sharding: Hashed sharding and Range sharding which you can read more about here: Sharding strategies. Having said that, you don't need to specify the range of the shards yourself, mongo will take care of it. Especially when a time comes when you'll need to add a new shard, mongo will rearrange the chunks into the new shard.
If your cluster has only two shards, then it isn't difficult to design for. However, if your data will continue to grow and you end up having a lot more shards, then the choice of shard key is more difficult.
For example, if some machines have many more records than others (e.g. one machine has 3000 records i.e. 3% of the total), then that doesn't cause problems with only two shards. But if your data grows so that you need 100 shards, and one machine still has 3% of the total, then Machine_ID is no longer a good choice: because a single machine's records have to be a single chunk, and cannot be distributed across several shards.
In that case, a better strategy might be to use a hash of the Timestamp - but it depends on the overall shape of your dataset.
I read mongodb sharding guide, but I am not sure what kind of shard key suits my application. Any suggestions are welcome.
My application is a huge database of network events. Each document has a time filed and a couple of network related values like IP address and port number. I have the insert rate of 100-1000 items per seconds. In my experience with a single mongod, one single shard has no problem with this insert rate.
But I extensively use aggregation framework on huge amount of data. All the aggregations has a time limit--i.e. mostly the recent month or the recent week. I tested the aggregations on one singe mongod and a query with response time of 5 minutes while the insertion is off could take as long as two hours if the 200 insert per seconds is activated.
Can I improve mongodb aggregation query response time by sharding?
If yes, I think I have to use time as the shard key, because in my application, every query has to be run on a time limit (e.g. Top IP addresses in the recent month) and if we can separate the shard that inserting is take place and the shard that the query is working the mongodb could work much faster.
But the documenation says
"If the shard key is a linearly increasing field, such as time, then all requests for a given time range will map to the same chunk, and thus the same shard. In this situation, a small set of shards may receive the majority of requests and the system would not scale very well."
So what shall I do?
Can I improve mongodb aggregation query response time by sharding?
Yes.
If you shard your database on different machines this will provide parallel computing power on aggregation functions. The important thing is here the distribution of the data in your shards. It should be uniform distributed.
If you choose your shard key as a monotonicly increasing(or decreasing) field of the document-like time- and use hashed sharding, this will provide uniform distribution over the cluster.
I am trying to confirm whether the sharding and chunking behaviour is correct in my MongoDB instance.
We have 2 shards, each with a replica set, and have:
(a) Enabled sharding using sh.enableSharding() command for my database
(b) Added an hashIndex to a new collection via db.X.ensureIndex() command
(c) Added the sharded collection to my database via sh.shardCollection() command.
When I run sh.status, I notice that only one of my two shards contains chunks, implying that my data is not distributed. I have added a couple of documents to test processing, but I still only see 1 chunk. Is this the correct behaviour? Intuitively, I would expect more 1..n chunks in each shard.
Thanks in advance,
Steve Westwood
Mongo will only start to split data across chunks when the first chunk has reached a certain size. When there are less than 10 chunks it will split when a chunk grows above about 16mb. When there are more chunks it will split them at 64mb.
So I expect you don't have enough data to trigger a chunk split.
You can override these chunk size values with the chunkSize option to the mongos, which can be useful for testing.
I have a little cluster which consists of several shards, and every shard is a replica set of 2 real nodes and 1 ARBITER. sharding is enabled on a collection, let's say generator_v1_food.
I've stopped all the programs updating the collection (in these programs, there are ONLY upsert and find operations, no remove at all). Then, the collection count returns like this (2-3 second interval). I've also turned off the balancer. The last lines of the log( the shard I operated on) were all about replica set.
mongos> db.generator_v1_food.find().count()
28279890
mongos> db.generator_v1_food.find().count()
28278067
mongos> db.generator_v1_food.find().count()
28278008
...
What is happening behind the scene? Any pointers would be great.
quote:
Just because you set balancer state to "off" does not mean it's not still running, and finishing cleaning up from the last moveChunk that was performed.
You should be able to see in the config DB in changelog collection when the last moveChunk.commit event was - that's when the moveChunk process committed to documents from some chunk being moved to the new (target) shard. But after that, asynchronously the old shard needs to delete the documents that no longer belong to it. Since the "count" is taken from meta data and does not actually query for how many documents there are "for real" it will double count documents "in flight" during balancing rounds (or any that are not properly cleaned up or from aborted balance attempts).
Asya
I have a collection where the sharding key is UUID (hexidecimal string). The collection is huge: 812 millions of documents, about 9600 chunks on 2 shards. For some reason I initially stored documents which instead of UUID had integer in sharding key field. Later I deleted them completely, and now all of my documents are sharded by UUID. But I am now facing a problem with chunk distribution. While I had documents with integer instead of UUID, balancer created about 2700 chunks for these documents, and left all of them on one shard. When I deleted all these documents, chunks were not deleted, they stayed empty and they will always be empty because I only use UUID now. Since balancer distrubutes chunks relying on chunk count per shard, not document count or size, one of my shards takes 3 times more disk space than another:
--- Sharding Status ---
db.click chunks:
set1 4863
set2 4784 // 2717 of them are empty
set1> db.click.count()
191488373
set2> db.click.count()
621237120
The sad thing here is mongodb does not provide commands to remove or merge chunks manually.
My main question is, whould anything of this work to get rid of empty chunks:
Stop the balancer. Connect to each config server, remove from config.chunks ranges of empty chunks and also fix minKey slice to end at beginning of first non-empty chunk. Start the balancer.
Seems risky, but as far as I see, config.chunks is the only place where chunk information is stored.
Stop the balancer. Start a new mongod instance and connect it as a 3rd shard. Manually move all empty chunks to this new shard, then shut it down forever. Start the balancer.
Not sure, but as long as I dont use integer values in sharding key again, all queries should run fine.
Some might read this and think that the empty chunks are occupying space. That's not the case - chunks themselves take up no space - they are logical ranges of shard keys.
However, chunk balancing across shards is based on the number of chunks, not the size of each chunk.
You might want to add your voice to this ticket: https://jira.mongodb.org/browse/SERVER-2487
Since the mongodb balancer only balances chunks number across shards, having too many empty chunks in a collection can cause shards to be balanced by chunk number but severely unbalanced by data size per shard (e.g., as shown by db.myCollection.getShardDistribution()).
You need to identify the empty chunks, and merge them into chunks that have data. This will eliminate the empty chunks. This is all now documented in Mongodb docs (at least 3.2 and above, maybe even prior to that).