Aggregation using $sample - mongodb

With an aggregation using { $sample: { size: 3 } }, I'll get 3 random documents returned.
How can I use a percentage of all documents instead?
Something that'd look like { $sample: { size: 50% } }?

You can not do it, as expression to $sample should be a positive number.
If you still needed to use $sample you can try to get the total count of documents in a collection, get number half of it & then run $sample :
1) Count no.of documents in a collection (mongo Shell) :
var totalDocumentsCount = db.yourCollectionName.count()/2
print(totalDocumentsCount) // Replace it with console.log() in code
2) $sample for random documents :
db.yourCollectionName.aggregate([{$sample : {size : totalDocumentsCount}}])
Note :
If you wanted to get half of the documents from the collection (Which is 50% of documents) then $sample might not be a good option - it can become an inefficient query. Also result of $sample can have duplicate documents being returned (So really you might not get unique 50% of documents). Try to read more about it here : $sample

If someone is looking for this solution in PHP just use this as required in your aggregate at the end ( i.e before projection ) and avoid using limit and sort
[
'$sample' => [
'size' => 30
]
]

Starting in Mongo 4.4, you can use the $sampleRate operator:
// { x: 1 }
// { x: 2 }
// { x: 3 }
// { x: 4 }
// { x: 5 }
// { x: 6 }
db.collection.aggregate([ { $match: { $sampleRate: 0.33 } } ])
// { x: 3 }
// { x: 5 }
This matches a random selection of input documents (33%). The number of documents selected approximates the sample rate expressed as a percentage of the total number of documents.
Note that this is equivalent to adding a random number between 0 and 1 for each document and filtering them in if this random value is bellow 0.33. Such that you may get more or less documents in output, and running this several times won't necessarily give you the same output.

Related

Mongodb: Query on the last N documents(some portion) of a collection only

In my collection that has say 100 documents, I want to run the following query:
collection.find({"$text" : {"$search" : "some_string"})
Assume that a suitable "text" index already exists and thus my question is : How can I run this query on the last 'n' documents only?
All the question that I found on the web ask how to get the last n docs. Whereas My question is how to search on the last n docs only?
More generally my question is How can I run a mongo query on some portion say 20% of a collection.
What I tried
Im using pymongo so I tried to use skip() and limit() to get the last n documents but I didn't find a way to perform a query on cursor that the above mentioned function return.
After #hhsarh's anwser here's what I tried to no avail
# here's what I tried after initial answers
recents = information_collection.aggregate([
{"$match" : {"$text" : {"$search" : "healthline"}}},
{"$sort" : {"_id" : -1}},
{"$limit" : 1},
])
The result is still coming from the whole collection instead of just the last record/document as the above code attempts.
The last document doesn't contain "healthline" in any field therefore the intended result of the query should be empty []. But I get a documents.
Please can someone tell how this can be possible
What you are looking for can be achieved using MongoDB Aggregation
Note: As pointed out by #turivishal, $text won't work if it is not in the first stage of the aggregation pipeline.
collection.aggregate([
{
"$sort": {
"_id": -1
}
},
{
"$limit": 10 // `n` value, where n is the number of last records you want to consider
},
{
"$match" : {
// All your find query goes here
}
},
], {allowDiskUse=true}) // just in case if the computation exceeds 100MB
Since _id is indexed by default, the above aggregation query should be faster. But, its performance reduces in proportion to the n value.
Note: Replace the last line in the code example with the below line if you are using pymongo
], allowDiskUse=True)
It is not possible with $text operator, because there is a restriction,
The $match stage that includes a $text must be the first stage in the pipeline
It means we can't limit documents before $text operator, read more about $text operator restriction.
Second option this might possible if you use $regex regular expression operator instead of $text operator for searching,
And if you need to search same like $text operator you have modify your search input as below:
lets assume searchInput is your input variable
list of search field in searchFields
slice that search input string by space and loop that words array and convert it to regular expression
loop that search fields searchFields and prepare $in condition
searchInput = "This is search"
searchFields = ["field1", "field2"]
searchRegex = []
searchPayload = []
for s in searchInput.split(): searchRegex.append(re.compile(s, re.IGNORECASE));
for f in searchFields: searchPayload.append({ f: { "$in": searchRegex } })
print(searchPayload)
Now your input would look like,
[
{'field1': {'$in': [/This/i, /is/i, /search/i]}},
{'field2': {'$in': [/This/i, /is/i, /search/i]}}
]
Use that variable searchPayload with $or operator in search query at last stage using $in operator,
recents = information_collection.aggregate([
# 1 = ascending, -1 descending you can use anyone as per your requirement
{ "$sort": { "_id": 1 } },
# use any limit of number as per your requirement
{ "$limit": 10 },
{ "$match": { "$or": searchPayload } }
])
print(list(recents))
Note: The $regex regular expression search will cause performance issues.
To improve search performance you can create a compound index on your search fields like,
information_collection.createIndex({ field1: 1, field2: 1 });

Get the size of all the documents in a query

Is there a way to get the size of all the documents that meets a certain query in the MongoDB shell?
I'm creating a tool that will use mongodump (see here) with the query option to dump specific data on an external media device. However, I would like to see if all the documents will fit in the external media device before starting the dump. That's why I would like to get the size of all the documents that meet the query.
I am aware of the Object.bsonsize method described here, but it seems that it only returns the size of one document.
Here's the answer that I've found:
var cursor = db.collection.find(...); //Add your query here.
var size = 0;
cursor.forEach(
function(doc){
size += Object.bsonsize(doc)
}
);
print(size);
Should output the size in bytes of the documents pretty accurately.
I've ran the command twice. The first time, there were 141 215 documents which, once dumped, had a total of about 108 mb. The difference between the output of the command and the size on disk was of 787 bytes.
The second time I ran the command, there were 35 914 179 documents which, once dumped, had a total of about 57.8 gb. This time, I had the exact same size between the command and the real size on disk.
Starting in Mongo 4.4, $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to sum the bson size of all documents matching your query:
// { d: [1, 2, 3, 4, 5] }
// { a: 1, b: "hello" }
// { c: 1000, a: "world" }
db.collection.aggregate([
{ $group: {
_id: null,
size: { $sum: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "size" : 177 }
This $groups all matching items together and $sums grouped documents' $bsonSize.
$$ROOT represents the current document from which we get the bsonsize.

How do I calculate the average of top 20 percent of a collection in MongoDB Aggregate?

In a collection like: books : [{ stars: 10, valid: true }, { stars: 24, valid: false }, { stars: 76, valid: true }, ...], is simple calculate average with:
db.books.aggregate([
{ $match : {
valid: true
}},
{ $group : {
_id: null,
avg: { $avg: "$stars" } // <- How calculate $avg of top 20%?
}}
])
But, if I want average of top 20 percent of stars instead of average of all stars?
PS: Without know collection(valid: true) size, because unlike my example, I perform a lot of $unwind
OBS:
> db.version()
2.4.10
You need to fire two queries to achieve this.
Get the total count of stars whose valid attribute is true.
var bookCount = db.books.count({"valid":true});
Calculate the number of records top 20% for which the average needs to be calculated.
var limit = Math.ceil(.2*bookCount);
Perform the aggregation operation:
Match only those records, whose valid attribute is true.
Sort the records based on the stars attribute value, in descending
order, so that the top stars come first.
Limit the top 20% of the records.
Group them and calculate their averages.
The Code:
db.books.aggregate([
{$match:{"valid":true}},
{$sort:{"stars":-1}},
{$limit:limit},
{$group:{"_id":null,"avg":{$avg:"$stars"}}}
])
I perform a lot of $unwind
Your Sample data nor your code reflect this.

MongoDB aggregation over a range

I have documents of the following format:
[
{
date:"2014-07-07",
value: 20
},
{
date:"2014-07-08",
value: 29
},
{
date:"2014-07-09",
value: 24
},
{
date:"2014-07-10",
value: 21
}
]
I want to run an aggregation query that gives me results in date ranges. for example
[
{ sum: 49 },
{ sum:45 },
]
So these are daily values, I need to know the sum of value field for last 7 days. and 7 days before these. for example sum from May 1 to May 6 and then sum from May 7 to May 14.
Can I use aggregation with multiple groups and range to get this result in a single mongodb query?
You can use aggregation to group by anything that can be computed from the source documents, as long as you know exactly what you want to do.
Based on your document content and sample output, I'm guessing that you are summing by two day intervals. Here is how you would write aggregation to output this on your sample data:
var range1={$and:[{"$gte":["$date","2014-07-07"]},{$lte:["$date","2014-07-08"]}]}
var range2={$and:[{"$gte":["$date","2014-07-09"]},{$lte:["$date","2014-07-10"]}]}
db.range.aggregate(
{$project:{
dateRange:{$cond:{if:range1, then:"dateRange1",else:{$cond:{if:range2, then:"dateRange2", else:"NotInRange"}}}},
value:1}
},
{$group:{_id:"$dateRange", sum:{$sum:"$value"}}}
)
{ "_id" : "dateRange2", "sum" : 45 }
{ "_id" : "dateRange1", "sum" : 49 }
Substitute your dates for strings in range1 and range2 and optionally you can filter before you start to only operate on documents which are already in the full ranges you are aggregating over.

search in limited number of record MongoDB

I want to search in the first 1000 records of my document whose name is CityDB. I used the following code:
db.CityDB.find({'index.2':"London"}).limit(1000)
but it does not work, it return the first 1000 of finding, but I want to search just in the first 1000 records not all records. Could you please help me.
Thanks,
Amir
Note that there is no guarantee that your documents are returned in any particular order by a query as long as you don't sort explicitely. Documents in a new collection are usually returned in insertion order, but various things can cause that order to change unexpectedly, so don't rely on it. By the way: Auto-generated _id's start with a timestamp, so when you sort by _id, the objects are returned by creation-date.
Now about your actual question. When you first want to limit the documents and then perform a filter-operation on this limited set, you can use the aggregation pipeline. It allows you to use $limit-operator first and then use the $match-operator on the remaining documents.
db.CityDB.aggregate(
// { $sort: { _id: 1 } }, // <- uncomment when you want the first 1000 by creation-time
{ $limit: 1000 },
{ $match: { 'index.2':"London" } }
)
I can think of two ways to achieve this:
1) You have a global counter and every time you input data into your collection you add a field count = currentCounter and increase currentCounter by 1. When you need to select your first k elements, you find it this way
db.CityDB.find({
'index.2':"London",
count : {
'$gte' : currentCounter - k
}
})
This is not atomic and might give you sometimes more then k elements on a heavy loaded system (but it can support indexes).
Here is another approach which works nice in the shell:
2) Create your dummy data:
var k = 100;
for(var i = 1; i<k; i++){
db.a.insert({
_id : i,
z: Math.floor(1 + Math.random() * 10)
})
}
output = [];
And now find in the first k records where z == 3
k = 10;
db.a.find().sort({$natural : -1}).limit(k).forEach(function(el){
if (el.z == 3){
output.push(el)
}
})
as you see your output has correct elements:
output
I think it is pretty straight forward to modify my example for your needs.
P.S. also take a look in aggregation framework, there might be a way to achieve what you need with it.