I am trying to store key value data in MongoDb.
Key could be any string and I don't know about it anything more before storing, value could be any type (int, string, array). And I would like to have an index on such key & value.
I was looking on a (Multikey Index) over an array of my key-vals but looks like it can't cover queries over array fields.
Is it possible to have an index on a custom key & value in mongoDb and make queries with such operations as $exists and $eq and $gte, $lte, $and, $or, $in without COLLSCAN but through an IXSCAN stage?
Or maybe I need another Db for that?
I may have misunderstood your question but I think that this is precisely where MongoDB's strengths are - dealing with different shapes of documents and data types.
So let's say you have to following two documents:
db.test.insertMany([
{
key: "test",
value: [ "some array", 1 ]
},
{
key: 12.7,
values: "foo"
}
])
and you create a compound index like this:
db.test.createIndex({
"key": 1,
"value": 1
})
then the following query will use that index:
db.test.find({ "key": "test", "value": 1 })
and also more complicated queries will do the same:
db.test.find({ "key": { $exists: true }, "value": { gt: 0 } })
You can verify this by adding a .explain() to the end of the above queries.
UPDATE based on your comment:
You don't need the aggregation framework for that. You can simply do something like this:
db.test.distinct("user_id", { "key": { $exists: true } })
This query is going to use the above index. Moreover it can be made even faster by changing the index definition to include the "user_id" field like this:
db.test.createIndex({
"key" : 1.0,
"value" : 1.0,
"user_id" : 1
})
This, again, can be verified by running the following query:
db.test.explain().distinct("user_id", { "key": { $exists: true } })
If your key can be any arbitrary value, then this is impossible. Your best bet is to create an index on some other known field to limit the initial results so that the inevitable collection scan's impact is reduced to a minimum.
Related
I'm using MongoDB version 4.2.0. I have a collection with the following indexes:
{uuid: 1},
{unique: true, name: "uuid_idx"}
and
{field1: 1, field2: 1, _id: 1},
{unique: true, name: "compound_idx"}
When executing this query
aggregate([
{"$match": {"uuid": <uuid_value>}}
])
the planner correctly selects uuid_idx.
When adding this sort clause
aggregate([
{"$match": {"uuid": <uuid_value>}},
{"$sort": {"field1": 1, "field2": 1, "_id": 1}}
])
the planner selects compound_idx, which makes the query slower.
I would expect the sort clause to not make a difference in this context. Why does Mongo not use the uuid_idx index in both cases?
EDIT:
A little clarification, I understand there are workarounds to use the correct index, but I'm looking for an explanation of why this does not happen automatically (if possible with links to the official documentation). Thanks!
Why is this happening?:
Lets understand how Mongo chooses which index to use as explained here.
If a query can be satisfied by multiple indexes (satisfied is used losely as Mongo actually chooses all possibly relevant indexes) defined in the collection.
MongoDB will then test all the applicable indexes in parallel. The first index that can returns 101 results will be selected by the query planner.
Meaning that for that certain query that index actually wins.
What can we do?:
We can use $hint, hint basically forces Mongo to use a specific index, however Mongo this is not recommended because if changes occur Mongo will not adapt to those.
The query:
aggregate(
[
{ $match : { uuid : "some_value" } },
{ $sort : { fld1: 1, fld2: 1, _id: 1 } }
],
)
doesn't use the index "uuid_idx".
There are couple of options you can work with for using indexes on both the match and sort operations:
(1) Define a new compound index: { uuid: 1, fld1: 1, fld2: 1, _id: 1 }
Both the match and match+sort queries will use this index (for both the match and sort operations).
(2) Use the hint on the uuid index (using existing indexes)
Both the match and match+sort queries will use this index (for both the match and sort operations).
aggregate(
[
{ $match : { uuid : "some_value" } },
{ $sort : { fld1: 1, fld2: 1, _id: 1 } }
],
{ hint: "uuid_idx"}
)
If you can use find instead of aggregate, it will use the right index. So this is still problem in aggregate pipeline.
Context: I have a MongoDB populated with large number of emails. I'd like to do a search for all emails that include a given email address within any of the following fields: To, From, CC and BCC. The result needs to be sorted by the field Date. We're currently trying the following query:
db.collection.find({ $text : {$search: "\"email#domain.com\""}}).sort({Date:1})
I've tried doing a compound index including the date but it does not work.
With this index...
db.collection.createIndex({Date: 1, From:"text", To:"text", CC:"text", BCC:"text"})
it gives error 17007 as Date should have an equality match as it's a prefix. This is not an option as we'd like all emails regardless of the date.
Also with this other index...
db.collection.createIndex({From:"text", To:"text", CC:"text", BCC:"text", Date:1})
Then it gives error 17144 as it goes over the internal limit for the sort.
We've read the following:
Stackoverflow ref
Stackoverflow ref
mongoDB doc on compound index
In these references and others I'm getting the idea that this is not possible but I don't think what we're trying to do is atypical or so much out of the box.
Are we doing something wrong? Is there a way to do this query with compound index or any other MongoDB feature?
thanks!
Regardless of other compound index keys, you need to include the $meta for the "textScore" in order to get the correct sorting:
db.collection.find(
{ "$text": { "$search": "\"email#domain.com\""}},
{ "score": { "$meta": "textScore" } }
).sort({
"score": { "$meta": "textScore" }, "Date": 1
})
So naturally you want that "score" to sort first, and then by "Date" in order for things to be correctly ranked by relevance of the search.
The order of index does not matter, but of course you can ony have "one" text index. So make sure you drop all others before creating:
db.collection.createIndex({
"From": "text",
"To": "text",
"CC":"text",
"BCC": "text",
"Date":1
})
Look for indexes that are current with:
db.collection.getIndicies()
Or just drop everything and start fresh:
db.collection.dropIndexes()
For the data you appear to be searching on though, I would have thought a regular compound index on each field should suit you better. Looking for "email" addresses should be an "exact match", and if you expect multiple items for each field then they should be arrays of strings, like so:
{
"TO": ["bill#example.com"],
"FROM": ["ted#example.com"],
"CC": ["marty#example.com","sarah#example.com"],
"BCC": [],
"Date": ISODate("2015-07-27T13:42:05.535Z")
}
Then you need seperate indexes on each field, possibly in compound with "Date" like so:
db.email.createIndex({ "TO": 1, "Date": 1 })
db.email.createIndex({ "FROM": 1, "Date": 1 })
db.email.createIndex({ "CC": 1, "Date": 1 })
db.email.createIndex({ "BCC": 1, "Date": 1 })
And query with an $or condition:
db.email.find({
"$or": [
{ "TO": "sarah#example.com" },
{ "FROM": "sarah#example.com" },
{ "CC": "sarah#example.com" },
{ "BCC": "sarah#example.com" }
],
"Date": { "$lt": new Date() }
})
If you look at the .explain(true) (verbose) output from that, you should see that the winning plan is an "index intersection" of all the specified indexes. This works out to be very efficient as every field ( and index selected ) has an exact match value, and a range match on the indexed date.
That's going to be a lot better for you than the "fuzzy matching" of text searches. Even regular expressions should work better here in general ( for e-mail addresses ) and especially if they are "anchored" ^ to the start of the string.
Text indexes are meant for "word like tokens" to match, but this should not be your data. The $or does not look at nice, but it should do a much better job.
Have read this doc, it states that index can optimize update operation. Then, I am adding an index to my collection to optimize update operation I am using.
Records in the collection have object as _id, and a timestamp:
{_id: {userId: "sample"}, firstTimestamp: 123, otherField: "abc"}
What I want to do is operate update using query below:
db.userFirstTimestamp.update(
{_id: {userId: "sample"}, firstTimestamp: {$gt: 100}},
{_id: {userId: "sample"}, firstTimestamp: 100, otherField2: "efg"})
I want to store 'first document' based on 'firstTimestamp', field of old and new document can be different, hence it cannot be $set query, it should rewrite document instead. For sample below "otherField" should not be exist, it should be "otherField2" instead.
Based on my understanding on MongoDB doc and this article, I created index as per below
db.sample.createIndex({_id:1, timestamp:1})
Then I try to benchmark the query on an isolated experimental node using MongoDB 3.0.4 with spec below:
MongoDB 3.0.4
Machine is empty, no other operation, only mongo
RAM ~30GB
Disk is RAID 0 stripped
Collection has 60 million record
Average object size 1001 bytes
Index size 5.34 gig
When I check the log, many update query take more than 100ms, and when I do mongotop, top of the query is write query which takes ~1000ms. It is a bit slow since it takes that long to do one query.
When I do mongostat, throughput is only 400-500 query per second.
Then I try to do query explain using find query (since update does not support explain)
When I am not using projection, it is using default index {_id:1}.
When I am using projection for _id and timestamp only, it is using {_id:1, timestamp:1} index.
My question is:
Does index I have created help this update query?
If it is not helping, then how the index should be?
Any other way to optimize this update query?
Somewhat. But not optimally.
Should be this really, so index on the "element" of the object in the _id key:
db.sample.createIndex({ "_id.userId": 1, "timestamp": 1 })
Use the $set operator and stop overwiting your documents:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": { "otherfield": "cfg" }
}
)
But really your data "should" look like this:
{
"_id": "sample",
"firstTimestamp": 200,
"otherfield2": "sam"
}
And update like:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"otherfield2": "efg"
}
}
)
Or if you insist that fields other than "_id" and "firstTimestamp" are going to change a lot, then rather do this:
{
"_id": "sample",
"firstTimestamp": 200,
"data": {
"otherfield2": "sam"
}
}
When if you just want to replace data then do:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data": {
"overwritingField": "efg"
}
}
}
)
Since "data" can be replaced as an entire object if you wish, or just update a single key:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data.newfield": "efg"
}
}
)
In all cases, try to use the operators rather than replacing the whole object as it typically works out as more traffic and more load to the server.
But overall, what makes sense here is that the "userId" part "should" be the portion of the index that narrows down the results the most. So it definately goes before the timestamp, of which there should be a lot more possible values.
Compound primary keys are fine, but make sure you actually use them. A singular value would not make any sense and could just be assigned to _id. If you can just query on one field of they key as you are here, then you probably don't need a compound object as the primary key.
Your _id in the update suggests that you are getting exact matches for the _id therefore it is not a compound field with other keys. With this being the case, it should just a value in the _id itself.
Also a "range" is okay, but again consider that you are trying to match a single document ( well you don't mention "multi" anywhere ), so again questin why is it needed and either then go for an exact match or at "least" an upper limit.
The $set will "only" update the fields that you specifiy. I think you made a mistake in typing your question though, as the syntax for the "update" portion would not be valid. But use update operators anyway, as they send less traffic by sending a single field, or just the fields you intend to update.
I have to find all the documents which include "p396:branchCode" as key in MongoDB.Value does not matter,can be anything.I tried using
{"p396:branchCode": new RegExp(".*")}
in MongoVUE but i found nothing.
My db is very nested and branchCode has superkey "p396:checkTellersEApproveStatus"
Your key is nested under a super-key, so you need to use the dot-operator:
{"p396:checkTellersEApproveStatus.p396:branchCode": {$exists: true}}
This assumes p396:branchCode is always under p396:checkTellersEApproveStatus. When that's not the case, you have a problem, because MongoDB does not allow to do queries for unknown keys. When the number of possible super-keys is low, you could query for all of them with the $or-operator. When not, then your only option is to refactor your objects to arrays. To give an example, a structure like this:
properties: {
prop1: "value1",
prop2: "value2",
prop3: "value3"
}
would be far easier to query for values under arbitrary keys when made to look like this:
properties: [
{ key: "prop1", value:"value1"} ,
{ key: "prop2", value:"value2"},
{ key: "prop3", value:"value3"}
]
because you could just do db.collection.find({"properties.value":"value2"})
If you are actually "mixing types then that probably is not a good thing. But if all you care about is that the field $exists then that is the operator to use:
db.collection.find({
"p396:checkTellersApproveStatus.p396:branchCode": { "$exists": true }
})
If the values are actuall "all" numeric and you have an expected "range" then use $gt and $lt operators instead. This allows an "index" on the field to be used. And a "sparse" index where this was not present in all documents would improve performance:
db.collection.find({
"p396:checkTellersApproveStatus.p396:branchCode": {
"$gt": 0, "$lt": 99999
}
})
In all cases, this is the "child" of the parent "p396:checkTellersApproveStatus", so you use "not notation" to acess the full path to the property.
Sounds like you want to use the $exists operator.
{'p396:branchCode': {$exists: true}}
This assumes this query is part of the path:
{ 'p396:checkTellersApproveStatus': {'p396:branchCode': {$exists: true}}}
Which can be shortened to:
{ 'p396:checkTellersApproveStatus.p396:branchCode': {$exists: true}}
What I'm trying to do sounds logical to me however I'm not sure.
I am trying to improve part of a MongoDB collection by using Multikeys.
For example: I have multiple documents with the following format:
Document:
{
"_id": ObjectId("528a4177dbcfd00000000013"),
"name": "Shopping",
"tags": [
"retail",
"shop",
"shopping",
"store",
"grocery"
]
}
Query:
Up until now, I have been using the following query to match the tags field.
var tags = Array("store", "shopping", "etc");
db.collection.findOne({ 'tags': { $in: tags } }, { name: true });
This has been working well, however I think Multikeys should be used in this instance to improve speed & performance. Please, correct me if I am wrong!
Indexing:
I issued the following command in an attempt to index the tags.
db.collection.ensureIndex( { tags: 1 }, { safe: true }, function(err, doc) {} );
ensureIndex was successful.
Result:
However when using RockMongo's explain feature on the above query, the result is:
{
"indexOnly": false,
"indexBounds": {
"tags": [
[
"etc",
"etc"
],
[
"shopping",
"shopping"
],
[
"store",
"store"
]
]
}
}
Questions:
Why is indexing not working, is there something else I have to do?
Is Multikey indexing in this case beneficial? (I'm assuming yes.)
Is there another form of indexing that would be more beneficial?
Edit:
I've just noticed that in the RockMongo explain data there is a field:
"isMultiKey": true,
could it be that Multikeys are being used and I've completely misunderstood that it IS being indexed?
As you say in your edit, and coming from the part of explain you did not post is that isMulyiKey: true along with other information on the cursor are showing that the index is being used. The indexBounds are another indicator.
What is being described by indexOnly is the fact that your query contains another field, name, which is not part of the index. When the query optimizer sees that all elements of the query can be met by using the fields from within the index this is referred to as a covered query and the indexOnly property here is set to true.
So in an Ideal situation your query and results are using the information from the index only and MongoDB does not also have to look up the entry from the index in the collection in order to return more data.