I'm trying to understand how sharding will behave in a new MongoDB (v 4.4).
Let's say my data is composed of groups (several) that can contain devices (more than groups).
All these devices will store some time-dependent values over a period of time (3-9 months).
My initial idea is to create a device document for each 24 hours of data (in order to avoid hitting the 16MB/document limit).
This way a device will look something like:
{
_id: ObjectId(...)
grp_id: ObjectId(...) // OR DBRef ?
time_bucket: ISODate(...)
data: {...} // data for the last 24 hours specified by time_bucket
}
Now on to the sharding keys.
In order to have something non-monotonic I'm thinking of using {grp_id: "hashed"} since these will be user assigned, hence will not have any monotonicity assigned. Also devices from the smae group will be in the same shard (within chunk limits).
But groups are fairly low cardinality and high frequency, so the sharding key should become: {grp_id: "hashed", _id: 1}.
But this can cause trouble if I insert too much data for a device over time, so refine the key again: {grp_id: "hashed", _id: 1, time_bucket: 1}. This also has the nice ability to localize data, it will target only the shard in a certain time period.
Now to the question(s):
Will this approach work (in theory)?
If I already included grp_id: "hashed" will having _id: 1 in the key have any negative impact since it is a monotonic key?
Will the data be split more evenly between shards if I hash the _id instead of the grp_id (since grp_id is very high frequency and low cardinality compared to devices)? The key will be {_id: "hashed", grp_id: 1, time_bucket: 1}. In this case, will grp_id help with data locality, so devices from the same group will be placed on the same shard?
More generally: I think I'm a bit confused on how the data will be distributed when a hashed key will be used versus another in combination with range based keys and the fact that for range-based keys you can also define zones (which have to be manually managed).
I just tested the two sharding methods, {grp_id: "hashed", _id: 1, time_bucket: 1} (initial) vs {_id: "hashed", grp_id: 1, time_bucket: 1} (question 3):
The first approach yields imbalanced shards:
db.adminCommand({listDatabases:1, filter: {name: "test1"}})
{
"databases" : [
{
"name" : "test1",
"sizeOnDisk" : 22700032,
"empty" : false,
"shards" : {
"sharded-0" : 6377472,
"sharded-1" : 1425408,
"sharded-2" : 4210688,
"sharded-3" : 10686464
}
}
],
"totalSize" : 22700032,
"totalSizeMb" : 21,
"ok" : 1,
"operationTime" : Timestamp(1602680982, 2),
"$clusterTime" : {
"clusterTime" : Timestamp(1602680982, 2),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
}
While the second one yields balanced shards:
db.adminCommand({listDatabases:1, filter: {name: "test2"}})
{
"databases" : [
{
"name" : "test2",
"sizeOnDisk" : 21512192,
"empty" : false,
"shards" : {
"sharded-0" : 5152768,
"sharded-1" : 5185536,
"sharded-2" : 5218304,
"sharded-3" : 5955584
}
}
],
"totalSize" : 21512192,
"totalSizeMb" : 20,
"ok" : 1,
"operationTime" : Timestamp(1602682074, 1),
"$clusterTime" : {
"clusterTime" : Timestamp(1602682074, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
}
Both DBs contain 6 groups, each group 1000 devices, each device has 30 instances (data from 30 days).
I wonder what's the reason behind that, but I might ask a separate question.
Related
In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.
One of my aggregation pipelines is running rather slow.
About the collection
The collection is named as Document and each document can belong to multiple campaigns and be in one of the five statues, 'a' to 'e'. A small portion of documents may belong to no documents and its campaigns field is set to null.
Sample document:
{_id:id, campaigns:['c1', 'c2], status:'a', ...other fields...}
Some collection stats
Number of documents: 2 million only :(
Size: 2GB
Average document size: 980 bytes.
Storage Size: 780MB
Total index size: 134MB
Number of indexes: 12
Number of fields in document: 30-40, may have array or objects.
About the Query
The query is targeting to count the number of documents per campaign per status if its status is in ['a', 'b', 'c']
[
{$match:{campaigns:{$ne:null}, status:{$in:['a','b','c']}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}
]
It's expected that the aggregation is going to hit almost the whole collection.
When without index the aggregation is taking around 8 seconds to complete.
I tried to create an index on
{campaings:1, status:1}
Explaining plan shows that the index was scanned but the aggregation took near 11 seconds to complete.
Question
The index consists all fields required by the aggregation to do the counting. Should the aggregation not hit the index only? The index is only 10MB in size. How could it be slower? If not index, any other recommendation to tune the query?
Winning plan shows:
{
"stage" : "FETCH",
"filter" : {"$not" : {"campaigns" : {"$eq" : null}}},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {"campaigns" : 1.0,"status" : 1.0},
"indexName" : "campaigns_1_status_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"campaigns" : ["[MinKey, null)", "(null, MaxKey]"],
"status" : [ "[\"a\", \"a\"]", "[\"b\", \"b\"]", "[\"c\", \"c\"]"]
}
}
}
If no index, winning plan:
{
"stage" : "COLLSCAN",
"filter" : {
"$and":[
{"status": {"$in": ["a", "b", "c"]}},
{"$not" : {"campaigns": {"$eq" : null}}}
]
},
direction" : "forward"
}
Update
As requested by #Kevin, here're some details about other all indexes, size in MB.
"indexSizes" : {
"_id_" : 32,
"team_1" : 8, //Single value field of ObjectId
"created_time_1" : 16, //Document publish time in source system.
"parent_1" : 2, //_id of parent document.
"by.id_1" : 13, //_id of author from a different collection.
"feedids_1" : 8, //Array, _id of ETL jobs contributing to sync of this doc.
"init_-1" : 2, //Initial load time of the doc.
"campaigns_1" : 10, //Array, _id of campaigns
"last_fetch_-1" : 13, //Last sync time of the doc.
"categories_1" : 8, //Array, _id of document categories.
"status_1" : 8, //Status
"campaigns_1_status_1" : 10 //Combined index of campaign _id and status.
},
After reading the docs from MongoDB, I found this:
The inequality operator $ne is not very selective since it often matches a large portion of the index. As a result, in many cases, a $ne query with an index may perform no better than a $ne query that must scan all documents in a collection. See also Query Selectivity.
Looking at a few different articles using the $type operator might solve the problem.
Could you use this query:
db.data.aggregate([
{$match:{campaigns:{$type:2},status:{$in:["a","b","c"]}}},
{$unwind:'$campaigns'},
{$group:{_id:{campaign:'$campaigns', status:'$status'}, total:{$sum:1}}}])
I am design a Mongodb collection which can save the statistic for daily volume
Here is my DB schema
mongos> db.arq.findOne()
{
"_id" : ObjectId("553b78637e6962c36d67c728"),
"ip" : NumberLong(635860665),
"ts" : ISODate("2015-04-25T00:00:00Z"),
"values" : {
"07" : 2,
"12" : 1
},
"daily_ct" : 5
}
mongos>
And Here is my Indexes
mongos> db.arq.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ip" : 1
},
"name" : "ip_1",
"ns" : "Query_Volume.test"
},
{
"v" : 1,
"key" : {
"ts" : 1
},
"name" : "ts_1",
"expireAfterSeconds" : 15552000,
"ns" : "Query_Volume.test"
}
]
mongos>
Note: I have a time stamp index since I need to use TTL mechanism.
But the Sharding Key has any suggestion?
You have multiple options:
{ts: 1} Your timestamp. The data of certain ranges will be located together, but the key is monotonically increasing, and I'm not sure, whether the TTL index will clean up shard chunks. Means: The write load switches from shard to shard, and you have a shard with high write load whereas the other shards will get no writes for the data. This pattern works nicely if you query contiguous time ranges but has downsides in writing.
{ts: "hashed"} Hash-based sharding. The data will be sharded more or less evenly across the shards. Hash-based sharding distributes the write load but involves all shards (more or less) when querying for data.
You will need to test, what fits the best for your reads and writes. The sharding key depends on the data structure and the read/write patterns of your application.
I have a collection named App and need to query those active (active: true) apps that belong to a particular user (user_id) or are available to all users (by their _id). I use query like this
{
"active" : true,
"$or" : [
{
"user_id" : "111111111111111111111111"
},
{
"_id" : {
"$in" : [
ObjectId("222222222222222222222222"),
ObjectId("333333333333333333333333"),
ObjectId("444444444444444444444444")
]
}
}
]
}
However in db.currentOp(true) I see that this query is running very slowly: lockStats.timeLockedMicros.r is about 3000.
How can I optimize performance of this query? I already have the following indexes on App:
> db.App.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.App"
},
{
"v" : 1,
"key" : {
"active" : 1,
"created_at" : -1
},
"name" : "active_1_created_at_-1",
"ns" : "mydb.App",
"background" : true
},
{
"v" : 1,
"key" : {
"active" : 1,
"user_id" : 1
},
"name" : "active_1_user_id_1",
"ns" : "mydb.App",
"background" : true
}
]
Two issues I see here:
1) You would not need index on the boolean field active as it would have low selectivity and not benefiting query performance.
"If overall selectivity is low, and if MongoDB must read a number of documents to return results, then some queries may perform faster without indexes." source
2) You need an index for user_id because user_id cannot use the compound index you created for active_1_user_id_1
Edit: You can always check index efficiency by doing a explain(true) and look at which indexes are used for that query.
I would try to do the following:
remove all your indexes, your active field has a low cardinality (boolean) and does not help you at all, you are not using created_at, so there is no reason for it.
add an index only on user_id key
change your strings as numbers to numbers.
I am trying to understand the following behavior displayed by my sharding setup. The data seems to only increase on a single shard as I continuously add data. How does MongoDB shard or distribute data across different servers? Am I doing this correctly? MongoDB version 2.4.1 used here on OS X 10.5.
As requested, sh.status() as follows:
mongos> sh.status()
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 3,
"currentVersion" : 4,
"clusterId" : ObjectId("52787cc2c10fcbb58607b07f") }
shards:
{ "_id" : "shard0000", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0001", "host" : "xx.xx.xx.xxx:xxxxx" }
{ "_id" : "shard0002", "host" : "xx.xx.xx.xxx:xxxxx" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "newdb", "partitioned" : true, "primary" : "shard0001" }
newdb.prov
shard key: { "_id" : 1, "jobID" : 1, "user" : 1 }
chunks:
shard0000 43
shard0001 50
shard0002 43
Looks like you have chosen a very poor shard key. You partitioned along the values of { "_id" : 1, "jobID" : 1, "user" : 1 } - this will not be a good distribution for inserts since _id values are monotonically increasing since you are using ObjectId() values for _id.
You want to select a shard key that represents how you access the data - it doesn't make sense that you have two more fields after _id - since _id is unique the other two fields are never going to be used to partition the data.
Did you perhaps intend to shard on jobID, user? It's hard to know what the best shard key would be in your case, but it's clear that all the inserts are going into the highest chunk (top value through maxKey) since every new _id is a higher value than the previous one.
Eventually they should be balanced to other shards, but only if the balancer is running, if all your config servers are up and if secondaries are caught up. Best to pick a better shard key and have inserts be distributed evenly across the cluster from the start.