MongoDB convert to capped collection - unexpected collection size - mongodb

I have a MongoDB v6.0.3 setup. I am trying to convert a normal, prepopulated collection (with 10 documents) into a capped collection of size 5.
The scripts I used:
db.testCollection.drop();
db.testCollection.insertMany([
{"key": 1},
{"key": 2},
{"key": 3},
{"key": 4},
{"key": 5},
{"key": 6},
{"key": 7},
{"key": 8},
{"key": 9},
{"key": 10},
]);
db.runCommand({"convertToCapped": "testCollection", size: 5});
But when I am verifying the result, I get an output of 8 documents instead of the expected 5 documents:
db.testCollection.countDocuments(); // output: 8
db.testCollection.find(); // output: document with key from 3 to 10
What I have tried:
use another MongoDB v5.0.3 to verify the behaviour: same result
insert some more records to see if it will return to expected 5 documents: same result
db.testCollection.insertOne({"key": 11});
db.testCollection.countDocuments(); // output: 8
db.testCollection.find(); // output: document with key from 4 to 11
changing the capped collection size in v6.0.3 setup: same result
db.runCommand( { collMod: "testCollection", cappedSize: 5 } )
Any explanation for this unexpected behaviour?

The size field represents the maximum size of the collection in bytes, which MongoDB will pre-allocate for the collection. If the size field is less than or equal to 4096, then the collection will have a cap of 4096 bytes. Otherwise, MongoDB will raise the provided size to make it an integer multiple of 256.
In your case you should use max field which specifies a maximum number of documents for the collection.
Note
The size argument is always required, even when you specify max number of documents. MongoDB will remove older documents if a collection reaches the maximum size limit before it reaches the maximum document count.
You query should be somewhat like below:
If you want to create a new capped collection.
db.createCollection("testCollection", { capped: true, size: 4096, max: 5 })
If you want to convert a collection to capped.
db.runCommand({"convertToCapped": "testCollection", size: 4096, max: 5});
If you want to change a capped collection's size.
db.runCommand( { collMod: "testCollection", cappedSize: 4096 } )
If you want to change the maximum number of documents in a capped collection.
db.runCommand( { collMod: "testCollection", cappedMax: 5} )
Note
Queries are supported in Mongodb version 6.0
Reference

Related

MongoDB match on document and subdocuments, what to use as indexes?

I have a lot of documents looking like this:
[{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"price": 433,
"automatic": false,
"destination": "5d26fc92f72acc7a0b19f2c4",
"date": "2020-01-19T00:00:00.000+00:00",
"days": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"automatic": true,
"destination": "5d26fc92f72acc7a0b19f2c4",
"prices": [{
"price": 433,
"date_from": "2020-01-19T00:00:00.000+00:00",
"date_to": "2020-01-28T00:00:00.000+00:00",
"day_count": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"price": 899,
"date_from": "2020-04-19T00:00:00.000+00:00",
"date_to": "2020-04-28T00:00:00.000+00:00",
"day_count": 19,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
}
]
}
]
As you can see, automatic deals have multiple prices (can be a lot, between 1000 and 4000) and does not have the original fields available.
Now I need to search in the original document as well in the subdocuments to look for a match.
This is the aggregation I use to search through the documents:
[{
"$match": {
"destination": {
"$in": ["5d26fc9af72acc7a0b19f313"]
}
}
}, {
"$match": {
"$or": [{
"prices": {
"$elemMatch": {
"price": {
"$lte": 1500,
"$gte": 400
},
"date_to": {
"$lte": "2020-04-30T22:00:00.000Z"
},
"date_from": {
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}
}
}, {
"price": {
"$lte": 1500,
"$gte": 400
},
"date": {
"$lte": "2020-04-30T22:00:00.000Z",
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}]
}
}, {
"$limit": 20
}]
I would like to speed things up, because it can be quite slow. I was wondering, what is the best index strategy for this aggregate, what fields do I use? Is this the best way of doing it or is there a better way?
From Mongo's $or docs:
When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So with that in mind in order to avoid a collection scan in this pipeline you have to create a compound index containing both price and prices fields.
Remember that order matters in compound indexes so the order of the field should vary depending on your possible usage of it.
It seems to me that the index you want to create looks something like:
{destination: 1, date: 1, board_type: 1, price: 1, prices: 1}
A compound index including the match filter fields is required to make the aggregation run fast. In aggregation queries, having the $match stage early in the pipeline (preferably, first stage) utilizes indexes, if any are defined on the filter fields. In the posted query it is so, and defining the indexes is all needed for a fast query. But, index on what fields?
The index is going to be compound index; i.e., index on multiple fields of the query criteria. The index prefix starts with the destination field. The remaining index fields are to be determined. What are the remaining fields?
Most of these fields are in the prices array's sub-document fields - price, date_from, date_to and board_type. There is also the date field from the main document. Which of these fields need to be used in the compound index?
Defining indexes on array elements (or fields of sub-documents in an array) creates lots of index keys. This means lots of storage and for using the index the memory (or RAM). This is an important consideration. Indexes on array elements are called as multikey indexes. For an index to be properly utilized, the collection's documents and the index being used by the query (together called as working set) must fit into the RAM.
Another aspect you need to consider is the query selectivity. How many documents gets selected using a filter which uses an index field, is a factor. It is imperative that the filter field with must select a small set of the input documents to be effective. See Create Queries that Ensure Selectivity.
It is difficult to determine what other fields need to be considered (sure some of the fields of the prices) based on the above two factors. So, the index is going to be something like this:
{ destination: 1, fld1: 1, fld2: 1, ... }
The fld1, fld2, ..., are going to be the prices array sub-document fields and / or the date field. I think only one set of date fields can be used with the index. An example index can be one of these:
{ destination: 1, date: 1, "prices.price": 1, "prices.board_type": 1}
{ destination: 1, "prices.price": 1, "prices.date_from": 1, "prices.date_to": 1, "prices.board_type": 1}
Note the index keys order and the necessity of the price, date_from, date_to and board_type is to be determined based upon the two main factors - requirement of the working set and the query selectivity - this is important.
NOTES: On a small sample data set with similar structure showed usage of the compound index with the primary destination field and two fields from the prices (one with equality condition and one with range condition). The query plan using the explain showed an IXSCAN (index scan) on the compound index, and using an index will sure improve the query performance.

Querying range from collection mongodb

I can query the first 20 datapoints from my collection using the following code
db.collections.aggregate([{$project: {"text": 1}}, {$limit:20}])
How do I query a range from my collection? Let's say, from data points 20 through 40?
Below are the possible ways :
Using db..find()
db.collection.find({},{"text": 1}).skip(20).limit(20);
Using aggregation framework
db.collection.aggregate([{$project: {"text": 1}}, { $skip : 20 }, {$limit:20}])

mongodb statistics 5 million doc too slow

5 million mongo doc:
{
_id: xxx,
devID: 123,
logLevel: 5,
logTime: 1468464358697
}
indexes:
devID
my aggregate:
[
{$match: {devID: 123}},
{$group: {_id: {level: "$logLevel"}, count: {$sum: 1}}}
]
aggregate result:
{ "_id" : { "level" : 5 }, "count" : 5175872 }
{ "_id" : { "level" : 1 }, "count" : 200000 }
aggregate explain:
numYields:42305
29399ms
Q:
if mongo without writing(saving) data, it will take 29 seconds
if mongo is writing(saving) data, it will take 2 minutes
my aggregate result need to reply to web, so 29sec or 2min are too long
How can i solve it? preferably 10 seconds or less
Thanks all
In your example, the aggregation query for {devID: 123, logLevel:5} returns a count of 5,175,872 which looks like it counted all the documents in your collection (since you mentioned you have 5 million documents).
In this particular example, I'm guessing that the {$match: {devID: 123}} stage matches pretty much every document, hence the aggregation is doing what is essentially a collection scan. Depending on your RAM size, this could have the effect of pushing your working set out of memory, and slow down every other query your server is doing.
If you cannot provide a more selective criteria for the $match stage (e.g. by using a range of logTime as well as devID), then a pre-aggregated report may be your best option.
In general terms, a pre-aggregated report is a document that contains the aggregated information you require, and you update this document every time you insert into the related collection. For example, you could have a single document in a separate collection that looks like:
{log:
{devID: 123,
levelCount: [
{level: 5, count: 5175872},
{level: 1, count: 200000}
]
}}
where that document is updated with the relevant details every time you insert into the log collection.
Using a pre-aggregated report, you don't need to run the aggregation query anymore. The aggregated information you require is instead available using a single find() query instead.
For more examples on pre-aggregated reports, please see https://docs.mongodb.com/ecosystem/use-cases/pre-aggregated-reports/

Mongo efficient querying log collection by level range

I have a capped collection for storing server logs:
var schema = new mongoose.Schema({
level: { type: Number, required: true },
...
}, { capped: 64 * 1024 * 1024, versionKey: false });
I'm having trouble figuring out how to query logs by level range efficiently. Here's a sample query I want to run:
db.getCollection('logs').find({
level: { $gte: 2, $lte: 6 }
}).sort({ _id: -1 }).limit(500)
Indexing on { _id: 1, level: 1 } doesn't make any sense, as _id is unique and there will be only a single level for each of them, so in worst case whole collection will be checked.
If I index on { level: 1, _id: -1 }, in worst case Mongo pulls all logs for levels 2, 3, 4, 5, 6 joins them and sorts them manually, so performance is horrible. Sometimes it also decides to use { _id: 1 } index, which is terrible too.
It could just walk through these 6 indexes at once and get the result while checking at most 504 documents. Or it could pull only first 500 results from each level, so it would sort at most 2500 documents. But it won't, Mongo is just plain stupid when it comes to range queries.
The fastest solution I can think of is implementing the last mentioned method on the client, so running 5 queries and then merging them manually:
db.getCollection('logs').find({ level: 2 }).sort({ _id: -1 }).limit(500)
db.getCollection('logs').find({ level: 2 }).sort({ _id: -1 }).limit(500)
db.getCollection('logs').find({ level: 3 }).sort({ _id: -1 }).limit(500)
...
Merging can be done in O(n) on the client, there are only 7 log levels so at most 7 queries will be executed and 3500 documents pulled from the database.
Is there a better way?
Since you have only 7 levels, it may worth to consider { level: 1, _id: -1 } index with $or query:
db.logs.find({$or:[
{level: 2},
{level: 3},
{level: 4},
{level: 5},
{level: 6}
]}).sort({_id:-1}).limit(500)
Since it is equality condition, it should make use of the index, but I never tried it on capped collections.
I would give it a try and run explain() to confirm it works, then probably enabled profiler and run few other queries.

MongoDB sparse Index and Array: too many documents indexed

I have a doubt on MongoDB sparse Index.
I have a collection (post) with very little documents (6K the biggest) that could embed a sub-document in this way:
{
"a": "a-val",
"b": "b-val",
"meta": {
"urls": [ "url1", "url2" ... ],
"field1": "value1",
...
}
}
The field "a" and "b" are always presents, but "meta.urls" could be non existent!
Now, I have inserted just one document with "meta.urls" value and then I did
db.post.ensureIndex({"a": 1, "b": 1, "meta.urls": 1}, {sparse: true});
post stats gives me a "strange" result: the index is about 97MB!
How is it possible? Only one document with "meta.urls" inserted, and index size is 97MB ?
So, I tried to create only "meta.urls" index in this way:
db.post.ensureIndex({"meta.urls": 1}, {sparse: true});
I have now "meta.urls_1" index with just 1 document.
But if I explain a simple query like this
db.post.find({"meta.urls": {$exists: true}}).hint("meta.urls_1").explain({verbose: true});
I have another "strange" result:
"n" : 1,
"nscannedObjects" : 5,
"nscanned" : 5,
Why Mongo scans 5 docs, an not just the one in the index?
If I query for a precise match on "meta.urls", the single sparse index will work correctly.
Example:
db.post.find({"meta.urls": "url1"}).hint("meta.old_slugs_1") // 1 document
For your first question: you can use a compound index to search on a prefix of the keys it indexes. For example, your first index would be used if you searched on just a or both a and b. Thus, the sparse will only fail to index docs where a is null.
I don't have an answer for your second question, but you should trying updating MongoDB and trying again - its moving pretty quickly, and sparse indexes have gotten better in the past few months.