What is the best Mongodb Sharding key for my schema? - mongodb

I am design a Mongodb collection which can save the statistic for daily volume
Here is my DB schema
mongos> db.arq.findOne()
{
"_id" : ObjectId("553b78637e6962c36d67c728"),
"ip" : NumberLong(635860665),
"ts" : ISODate("2015-04-25T00:00:00Z"),
"values" : {
"07" : 2,
"12" : 1
},
"daily_ct" : 5
}
mongos>
And Here is my Indexes
mongos> db.arq.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ip" : 1
},
"name" : "ip_1",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ts" : 1
},
"name" : "ts_1",
"expireAfterSeconds" : 15552000,
"ns" : "Query_Volume.test"
}
]
mongos>
Note: I have a time stamp index since I need to use TTL mechanism.
But the Sharding Key has any suggestion?

You have multiple options:
{ts: 1} Your timestamp. The data of certain ranges will be located together, but the key is monotonically increasing, and I'm not sure, whether the TTL index will clean up shard chunks. Means: The write load switches from shard to shard, and you have a shard with high write load whereas the other shards will get no writes for the data. This pattern works nicely if you query contiguous time ranges but has downsides in writing.
{ts: "hashed"} Hash-based sharding. The data will be sharded more or less evenly across the shards. Hash-based sharding distributes the write load but involves all shards (more or less) when querying for data.
You will need to test, what fits the best for your reads and writes. The sharding key depends on the data structure and the read/write patterns of your application.

Related

Entity insertions at Orion increasingly slower

When I have a lot of information in Context Orion Broker (in MongoDB) and when I try to insert more information increasingly your insertion is slower.
E.g.: at this moment basically I have in Orion 3GB of information and when I try to send more information to Orion, I'm waiting more or less 15 minutes to send 50MB, however, if I send the same information when the Orion was empty this process finish in 1 minute.
admin 0.000GB
config 0.000GB
local 0.000GB
orion 2.932GB
Is normally this process? I mean, increasingly your insertion to be slower.
Extra info: VPS Linux with 2 cores and 8GB ram.
Indexes information:
> use orion
switched to db orion
> show collections
entities
> db.entities.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "orion.entities"
},
{
"v" : 2,
"key" : {
"location.coords" : "2dsphere"
},
"name" : "location.coords_2dsphere",
"ns" : "orion.entities",
"2dsphereIndexVersion" : 3
},
{
"v" : 2,
"key" : {
"expDate" : 1
},
"name" : "expDate_1",
"ns" : "orion.entities",
"expireAfterSeconds" : 0
}
]
To speed up Orion DB operations you should create the indexes to optimize performance recommended in performance tunning Orion documentation. In particular:
{_id.servicePath: 1, _id.id: 1, _id.type: 1} (note that this is a compound index and key order matters in this case)
creDate
Now is fast... i use this(as you said): db.entities.createIndex({"_id.id": 1, "_id.type":
1, "_id.servicePath": 1})
Thank you !

Get shard id for given value of sharding key in mongodb

I have two replica sets in sharding cluster with documents sharded by userId field.
Is there any way to query which shard (replica set) is containgin given document (by _id or sharding key filed) without reimplemnting shadring key hashing on client side
You can use query explain() to identify the shard for a document, by querying based on the shard key.
The winning plan should have a SINGLE_SHARD stage with an equality query similar to the following (with some extra output trimmed for clarity):
> db.users.find({userId:123}).explain().queryPlanner.winningPlan
{
"stage" : "SINGLE_SHARD",
"shards" : [
{
"shardName" : "shard01",
"plannerVersion" : 1,
"namespace" : "test.users",
"indexFilterSet" : false,
"parsedQuery" : {
"userId" : {
"$eq" : 123
}
},
}
]
}
If you only want the shard name, you can use JavaScript notation to reference the full path:
> db.users.find({userId:123}).explain().queryPlanner.winningPlan.shards[0].shardName
shard01

mongodb single node performance

I use MongoDB for an internal ADMIN type of application used by my team.
Mongo is installed on 1 box and no replica sets.
ADMIN application inserts 70K to 100K documents/per day and we maintain 4 months of data. DB has ~100 million documents at any given time.
When the application was deployed, it all started fine for few days. As the data kept accumulated to reach the 4 months max limit, I see severe performance issues with MongoDB.
I installed MongoDB 3.0.4 as-is on a Linux box and did not fine tune any optimization settings.
Are there any optimization settings I need to adjust?
ADMIN application has schedulers which runs every 1/2 hr to insert and purge outdated data. Given below collection with indexes defined on createdDate,env,messageId,sourceSystem, I see few queries were taking 30 min to respond.
Sample query: Count of documents with a given env,sourceSystem, but between a given range of dates. ADMIN app uses grails and the above query is created using GORM. It used to work fine in the beginning. But over the period of time, performance degraded. I tried restarting the application as well. It didn't help. I believe using the MongoDB as-is (like a Dev Mode) might be causing performance issue. Any suggestions on what to tweak in settings (perhaps cpu/mem limits etc)?
{
"_id" : ObjectId("5575e388e4b001976b5e570f"),
"createdDate" : ISODate("2015-06-07T05:00:34.040Z"),
"env" : "prod",
"messageId" : "f684b34d-a480-42a0-a7b8-69d6d18f39e5",
"payload" : "JSON or XML DATA",
"sourceSystem" : "sourceModule"
}
Update:
Indices:
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"messageId" : 1
},
"name" : "messageId_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"createdDate" : 1
},
"name" : "createdDate_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"sourceSystem" : 1
},
"name" : "sourceSystem_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"env" : 1
},
"name" : "env_1",
"ns" : "admin.Message"
}
]

Sharding on MongoDB what about the other collections?

I understand that 'Hash Sharding' can be done on the collection level on a database on based on the key of the collection that is passed.
This ensures that records for that collection are distributed across all the shards.
I understand what happens with one collection. What about the other collections?
Does all the data of all the other tables get stored in one shard only?
Does it get replicated across all the shards?
Does it also get split and spread across all the shards?
The other collections will reside on a single shard (known as the primary shard) unless you decide to shard them also. The primary shard is set at the database level rather than collection, so all non-sharded collections in a particular database will all have the same primary shard. You can see the primary for any given database in the sh.status() output, as per the example below:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("54185b2c2a2835b6e47f7984")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:30000" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "shardTest", "partitioned" : true, "primary" : "shard0000" }
shardTest.foo
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "bar", "partitioned" : true, "primary" : "shard0000" }
bar.data
shard key: { "_id" : 1 }
chunks:
shard0000 1
{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0)
{ "_id" : "foo", "partitioned" : true, "primary" : "shard0000" }
foo.data
shard key: { "_id" : 1 }
chunks:
shard0000 9
In this example there is only one shard (shard0000), and hence it is the primary for all the databases ("primary" : "shard0000") except config which is a special case (and resides on the config servers). The primary shard for a database is chosen when the database is created.
Hence, if you only had one shard, created all your databases first and then added more shards later, all the databases you created before adding new shards will have their primary set to that first shard (there was nothing else to choose). Any databases created after you have multiple shards could end up with any shard as their primary, essentially it is selected using round robin, but each mongos will have its own idea about where it is in that round robin selection.

MongoDB Sharding Policy

I am trying to understand the following behavior displayed by my sharding setup. The data seems to only increase on a single shard as I continuously add data. How does MongoDB shard or distribute data across different servers? Am I doing this correctly? MongoDB version 2.4.1 used here on OS X 10.5.
As requested, sh.status() as follows:
mongos> sh.status()
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("52787cc2c10fcbb58607b07f") }
shards:
{ "_id" : "shard0000", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0001", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0002", "host" : "xx.xx.xx.xxx:xxxxx" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "newdb", "partitioned" : true, "primary" : "shard0001" }
newdb.prov
shard key: { "_id" : 1, "jobID" : 1, "user" : 1 }
chunks:
shard0000 43
shard0001 50
shard0002 43
Looks like you have chosen a very poor shard key. You partitioned along the values of { "_id" : 1, "jobID" : 1, "user" : 1 } - this will not be a good distribution for inserts since _id values are monotonically increasing since you are using ObjectId() values for _id.
You want to select a shard key that represents how you access the data - it doesn't make sense that you have two more fields after _id - since _id is unique the other two fields are never going to be used to partition the data.
Did you perhaps intend to shard on jobID, user? It's hard to know what the best shard key would be in your case, but it's clear that all the inserts are going into the highest chunk (top value through maxKey) since every new _id is a higher value than the previous one.
Eventually they should be balanced to other shards, but only if the balancer is running, if all your config servers are up and if secondaries are caught up. Best to pick a better shard key and have inserts be distributed evenly across the cluster from the start.