I'm trying to create a Mongo index with 2 text fields, whereby either field can have a value in another document, but the same pair cannot. I am familiar with this concept in MySQL, but do not understand it in Mongo.
I would like to create a unique index on the symbol and date fields of these documents:
db.earnings_quotes.insert({"symbol":"UNF","date":"2017-01-04","quote":{"price": 5000}});
db.earnings_quotes.createIndex({symbol: 'text', date: 'text'}, {unique: true})
db.earnings_quotes.insert({symbol: 'HAL', date: '2018-01-22', quote: { "price": 10000 }});
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: sample.earnings_quotes.$symbol_text_date_text dup key: { : \"01\", : 0.6666666666666666 }"
}
})
I don't understand the error message here... In this case, neither symbol, nor date overlap with the first record.
A text index actually behaves a bit like a multikey index, it tries to cut text into bits that can be then queried using specific text search operators. Also, the order of the fields in the text index doesn't really matter (compared to a normal compound index), MongoDB will just go through all the values in both symbol and date and index those separately.
In this case I believe that mongo tries to index the 01 in the 2017 and the 01 in -01- separately.
I don't think in your case you really want to do a text index, it's made for searching through long texts, not fields with single values in them.
And also, the multikey nature of the text index makes it really hard to stay unique.
My advice would be to go like this:
db.earnings_quotes.createIndex({symbol: 1, date: 1}, {unique: true})
By default mongo uses _id as unique key and index, so one solution to your problem is save your data in _id field.
e.g:
{
"_id":{
"symbol" :"xyz" ,
"date" :"12-12-20" ,
}
//Other fields in collection
}
This will create a composite key.
Related
I have a field "productLowerCase" in my mongo documents. I created 2 indices
1. simple
{"productLowerCase" : 1}
2. compound
{"productLowerCase" : 1, "timestamp.milliseconds" : -1}
So If I run a query which has only productLowerCase, say:
db.coll.find({"productLowerCase" : {$regex : /^cap/}})
Which index will get used?
In this case mongo will use {"productLowerCase" : 1} this index, but you can remove this index, because if you have compound index you can search with first field without performance loss.
Beside this you can use explain() to explain your query.
My Query below:
db.chats.find({ bid: 'someID' }).sort({start_time: 1}).limit(10).skip(82560).pretty()
I have indexes on chats collection on the fields in this order
{
"cid" : 1,
"bid" : 1,
"start_time" : 1
}
I am trying to perform sort, but when I write a query and check the result of explain(), I still get the winningPlan as
{
"stage":"SKIP",
"skipAmount":82560,
"inputStage":{
"stage":"SORT",
"sortPattern":{
"start_time":1
},
"limitAmount":82570,
"inputStage":{
"stage":"SORT_KEY_GENERATOR",
"inputStage":{
"stage":"COLLSCAN",
"filter":{
"ID":{
"$eq":"someID"
}
},
"direction":"forward"
}
}
}
}
I was expecting not to have a sort stage in the winning plan as I have indexes created for that collection.
Having no indexes will result into the following error
MongoError: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM [duplicate]
However I managed to make the sort work by increasing the size allocation on ram from 32mb to 64mb, looking for help in adding indexes properly
The order of fields in an index matters. To sort query results by a field that is not at the start of the index key pattern the query must include equality conditions on all the prefix keys that precede the sort keys. The cid field is not in the query nor used for sorting, so you must leave it out. Then you put the bid field first in the index definition as you use it in the equality condition. The start_time goes after that to be used in sorting. Finally, the index must look like this:
{"bid" : 1, "start_time" : 1}
See the documentation for further reference.
I have created a collection with 100 documents (fields x & y), and created a normal index on fieldx and a sparse index on field y, as shown below :
for(i=1;i<100;i++)db.coll.insert({x:i,y:i})
db.coll.createIndex({x:1})
db.coll.createIndex({y:1},{sparse:true})
Then, I added a few docs without fields x & y as shown below:
for(i=1;i<100;i++)db.coll.insert({z:"stringggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg"})
Looking at db.coll.stats(), I found the sizes of the indexes:
storageSize:36864
_id:32768
x_1:32768
y_1:16384
As per the definition of sparse index, only documents containing the indexed field y are considered, hence y_1 occupies less space. But _id & x_1 indexes seem to contain all the documents in them.
If I perform a query - db.coll.find({z:99}).explain('executionStats')
It is doing a COLLSCAN and fetching the record. If this is the case, I am not clear on why MongoDB stores all the documents under _id & x_1 indexes, as it is a waste of storage space. Please help me understand. Pardon my ignorance if i missed something.
Thank you for your help.
In a "normal" index, missing fields are indexed with a null value. For example, if you have index of {a:1} and you insert {b:10} into the collection, the document will be indexed as a: null.
You can see this behaviour using a unique index:
> db.test.createIndex({a:1}, {unique:true})
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
> db.test.insert({b:1})
WriteResult({ "nInserted" : 1 })
> db.test.insert({c:1})
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "E11000 duplicate key error collection: test.test index: a_1 dup key: { : null }"
}
})
Both {b:1} and {c:1} are indexed as a: null, hence the duplicate key error message.
In your collection, you have 200 documents:
100 documents with {x:..., y:...}
100 documents with {z:...}
And your indexes are:
{x:1} (normal index)
{y:1} (sparse index)
The documents will be indexed as follows:
200 documents will be in the _id index, which is always created by MongoDB
200 documents will be in the {x:1} index, from {x:.., y:..} and {z:..} documents
100 documents will be in the {y:1} index
Note that the index sizes you posted shows the same ratio as the numbers above.
Regarding your questions:
The _id index is for MongoDB internal use, see Default _id index. You cannot drop this index, and attempts to remove it could render your database inaccessible.
The x_1 index is there because you told MongoDB to build it. It contains all the documents in your collection because it's a normal index. In the case of your collection, half of the values in the index are null.
The sparse y_1 index is half the size of the x_1 index because only 100 out of 200 documents contain the y field.
The query db.coll.find({z:99}) does not use any index because you don't have an index on the z field, hence it's doing a collection scan.
For more information about indexing, please see Create Indexes to Support Your Queries
I want to create a unique index over two columns where the index should allow multiple null values for the second part of the index. But:
db.model.ensureIndex({userId : 1, name : 1},{unique : true, sparse : true});
Throws a duplicate key exception: E11000 duplicate key error index: devmongo.model.$userId_1_name_1 dup key: { : "-1", : null }. I thought because of the sparse=true option the index should allow this constellation? How can I achieve this? I use MongoDB 2.6.5
Sparse compound indexes will create an index entry for a document if any of the fields exist, setting the value to null in the index for any fields that do not exist in the document. Put another way: a sparse compound index will only skip a document if all of the index fields are missing from the document.
As of v3.2, partial indexes can be used to accomplish what you're trying to do. You could use:
db.model.ensureIndex({userId : 1, name : 1}, { partialFilterExpression: { name: { $exists: true }, unique: true });
which will only index documents that have a name field.
NB: This index cannot be used by mongo to handle a query by userId as it will not contain all of the documents in the collection. Also, a null in the document is considered a value and a field that has a null value exists.
The compound index should be considered as a whole one, so unique requires (userId, name) pair must be unique in the collection, and sparse means if both userId and name missed in a document, it is allowed. The error message shows that there are at least two documents whose (userId, name) pairs are equivalent (if a field missed, the value can be considered as null).
In my case, it turns out field names are case sensitive.
So creating a compound index on {field1 : 1, field2 : 1} is not the same as {Field1 : 1, Field2 : 1}
I have a production mongo database of over 1B documents in a single collection sharded on _id across multiple servers. I'm trying to replicate recently updated records from this collection into Red Shift.
Shard keys:
db.sample_collection.ensureIndex({_id: "hashed"})
sh.shardCollection("sample_collection.sample_object", {_id: "hashed"})
Example 'sample_object' Document
{
"_id" : ObjectId("527a6c9226d6b7770ab05345"),
"p": ISODate("2013-10-27T14:30:18.000Z"),
"a" : {
"ln" : "Doe",
"id" : NumberLong(3),
"fn" : "John",
},
"co" : {
"ct" : 2,
"it" : [
{'t': 'loreum', 'u' : NumberLong(300), 'd': ISODate("2013-10-28T14:30:18.000Z")},
{'t': 'loreum', 'u' : NumberLong(400), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
"li" : {
"ct" : 2,
"it" : [
{'u' : NumberLong(500), 'd': ISODate("2013-10-30T14:30:18.000Z")},
{'u' : NumberLong(501), 'd': ISODate("2013-10-29T14:30:18.000Z")},
..]
},
}
Option #1:
I'm in the process of analyzing this data and I need to query for documents that were "updated" between a period.
i.e., I want to return all the objects that have been p (published) or an li.it (item) or co.it (item) added between '2014-07-01' and '2014-07-03'.
What would be the most performant way of doing this?
Option #2:
Another option that I'm evaluating is whether I want to add an 'u' property with an updated date to account for when the document was updated
(ie., li or co item added)
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
Would filtering on 'u' be more performant that Option 1? I'm looking at this option as using COPY FROM JSON from a mongoexport
Option #1 (multiple dates)
There isn't a good option to index this, as it looks like you would ideally want a compound index that includes p (date) plus two date arrays (lt.it and co.it). A compound index can only include at most one array field. Even if you could do this, the index would be very large given the suggested number of dates and the query would involve checking multiple fields to infer the last updated date.
Option #2 (single updated date)
Adding an indexed u (latest updated date) is definitely a better approach to allow a simple and performant query.
If I make the change to the process to ensure new documents have this property, how would I iterate through existing documents and add this retroactively?
You can use the $exists operator to find documents that do not have this field set yet.
Caveat on hashed shard key
To elaborate on Neil's comment: a hashed shard key gives you good write distribution at the expense of being able to do range queries (all queries become scatter-gather). If your common queries are range-based on date (and you are concerned about performance) then you could possibly chose a more appropriate shard key to support those queries. However, since shard keys are immutable and you want to query on an "updated" date, it doesn't sound like a change of shard key will help your use case.