Mongodb find field with dot in collection with 1M documents - mongodb

I am trying to find the query to find all the fields/keys in the collection with 1M records. The field is a nested field (field under a field under field). But I end up getting this error with output any suggestions on how to overcome this ??
mr = db.runCommand({
"mapreduce" : "MyCollectionName",
"map" : function() {
var f = function() {
for (var key in this) {
if (this.hasOwnProperty(key)) {
emit(key, null)
if (typeof this[key] == 'object') {
f.call(this[key])
}
}
}
}
f.call(this);
},
"reduce" : function(key, stuff) { return null; },
"out": "MyCollectionName" + "_keys"
});
print(db[mr.result].distinct("_id"));
lmdb> print(db[mr.result].distinct("_id"));
getting this error:
MongoServerError: distinct too big, 16mb cap
I am fairly new to mongodb, so forgive my ignorance…

Once you create new collection containing all the field names. You seem to be using distinct on "_id" field for this collection, "_id" are unique by definition, so can just run "db.{collectionName}.find({})" instead of distinct. Also, mongo docs explicitly state that distict results must not exceed max BSON size (16 Mb).
Results must not be larger than the maximum BSON size. If your results exceed the maximum BSON size, use the aggregation pipeline to retrieve distinct values using the $group operator, as described in Retrieve Distinct Values with the Aggregation Pipeline.
You can use use aggregate option for that. Like this:
db.{collectionName}.aggregate( [ { $group : { _id : "${fieldName}" } } ] )

Related

Group MongoDB collection by 5 fields where the collection has more than 5 million records

The Mongo query I used, which is working on smaller chunk of data is
var map = function () {
emit({item:this.item, outlet:this.outlet, batch:this.batch, item_name:this.item_name, outlet_name:this.outlet_name}, {count:1,_id:this._id});
}
var reduce = function(k, values) {
var result = {count: 0,_id:[]};
values.forEach(function(value) {
result.count += value.count;
result._id.push(value._id);
});
return result;
}
db.ItemMovementSummary.mapReduce(map,reduce,{out: { replace : "multi_result"}})
db.multi_result.find({'value.count' : {$gte : 2}})
The error mongo is giving me is-
"message" : "map reduce failed:MongoError: Exceeded depth limit of 150
when converting js object to BSON. Do you have a cycle?",
"code" : 17279
Is there any way to solve this, I did try to group them initially which failed with error
Plan executor error during group command :: caused by :: group() can't
handle more than 20000 unique keys
The db.collection.group() method does not work with sharded clusters. Use the aggregation framework or map-reduce in sharded environments.
The result set must fit within the maximum BSON document size.
In version 2.2, the returned array can contain at most 20,000 elements; i.e. at most 20,000 unique groupings. For group by operations that results in more than 20,000 unique groupings, use mapReduce. Previous versions had a limit of 10,000 elements.
http://man.hubwiz.com/docset/MongoDB.docset/Contents/Resources/Documents/docs.mongodb.org/manual/reference/command/mapReduce/index.html#dbcmd.mapReduce

Mongodb distinct on a key in a key value pair situation

I have the bellow document structure in mongodb, I want to bring distinct keys from the column customData
So if you look bellow, I want my result to be: key1,key2,key3,key4
doing
db.coll.distinct("customData")
will bring the values and not the keys.
{
"_id":ObjectId("56c4da4f681ec51d32a4053d"),
"accountUnique":7356464,
"customData":{
"key1":1,
"key2":2,
}
}
{
"_id":ObjectId("56c4da4f681ec51d32a4054d"),
"accountUnique":7356464,
"customData":{
"key3":1,
"key4":2,
}
}
Possible to do this with Map-Reduce since you have dynamic subdocument keys which the distinct method will not return a result for.
Running the following mapreduce operation will populate a separate collection with all the keys as the _id values:
var myMapReduce = db.runCommand({
"mapreduce": "coll",
"map" : function() {
for (var key in this.customData) { emit(key, null); }
},
"reduce" : function() {},
"out": "coll_keys"
})
To get a list of all the dynamic keys, run distinct on the resulting collection:
db[myMapReduce.result].distinct("_id")
will give you the sample output
["key1", "key2", "key3", "key4"]

Can MongoDB aggregate "top x" results in this document schema?

{
"_id" : "user1_20130822",
"metadata" : {
"date" : ISODate("2013-08-22T00:00:00.000Z"),
"username" : "user1"
},
"tags" : {
"abc" : 19,
"123" : 2,
"bca" : 64,
"xyz" : 14,
"zyx" : 12,
"321" : 7
}
}
Given the schema example above, is there a way to query this to retrieve the top "x" tags: E.g., Top 3 "tags" sorted descending?
Is this possible in a single document? e.g., top tags for a user on a given day
What if i have multiple documents that need to be combined together before getting the top? e.g., top tags for a user in a given month
I know this can be done by using a "document per user per tag per day" or by making "tags" an array, but I'd like to be able to do this as above, as it makes in place $inc's easier (many more of these happening than reads).
Or do I need to return back the whole document, and defer to the client on the sorting/limiting?
When you use object-keys as tag-names, you are making this kind of reporting very difficult. The aggreation framework has no $unwind-equivalent for objects. But there is always MapReduce.
Have your map-function emit one document for each key/value pair in the tags-subdocument. It should look something like this;
var mapFunction = function() {
for (var key in this.tags) {
emit(key, this.tags[key]);
}
}
Your reduce-function would then sum up the values emitted for the same key.
var reduceFunction = function(key, values) {
var sum = 0;
for (var i = 0; i < values.length; i++) {
sum += values[i];
}
return sum;
}
The complete MapReduce command would look something like this:
db.runCommand(
{
mapReduce: "yourcollection", // the collection where your data is stored
query: { _id : "user1_20130822" }, // or however you want to limit the results
map: mapFunction,
reduce: reduceFunction,
out: "inline", // means that the output is returned directly.
}
)
This will return all tags in unpredictable order. MapReduce has a sort and a limit option, but these only work on a field which has an index in the original collection, so you can't use it on a computed field. To get only the top 3, you would have to sort the results on the application-level. When you insist on doing the sorting and limiting on the database, define an output-collection to store the mapReduce results in (with the out-option set to out: { replace: "temporaryCollectionName" }) and then query that collection with sort and limit afterwards.
Keep in mind that when you use an intermediate collection, you must make sure that no two users run MapReduces with different queries into the same collection. When you have multiple users which want to view your top-3 list, you could let them query the output-collection and do the MapReduce in the background at regular intervales.

Find largest document size in MongoDB

Is it possible to find the largest document size in MongoDB?
db.collection.stats() shows average size, which is not really representative because in my case sizes can differ considerably.
You can use a small shell script to get this value.
Note: this will perform a full table scan, which will be slow on large collections.
let max = 0, id = null;
db.test.find().forEach(doc => {
const size = Object.bsonsize(doc);
if(size > max) {
max = size;
id = doc._id;
}
});
print(id, max);
Note: this will attempt to store the whole result set in memory (from .toArray) . Careful on big data sets. Do not use in production! Abishek's answer has the advantage of working over a cursor instead of across an in memory array.
If you also want the _id, try this. Given a collection called "requests" :
// Creates a sorted list, then takes the max
db.requests.find().toArray().map(function(request) { return {size:Object.bsonsize(request), _id:request._id}; }).sort(function(a, b) { return a.size-b.size; }).pop();
// { "size" : 3333, "_id" : "someUniqueIdHere" }
Starting Mongo 4.4, the new aggregation operator $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to find the bson size of the document whose size is the biggest:
// { "_id" : ObjectId("5e6abb2893c609b43d95a985"), "a" : 1, "b" : "hello" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a986"), "c" : 1000, "a" : "world" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a987"), "d" : 2 }
db.collection.aggregate([
{ $group: {
_id: null,
max: { $max: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "max" : 46 }
This:
$groups all items together
$projects the $max of documents' $bsonSize
$$ROOT represents the current document for which we get the bsonsize
Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).
You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1 MongoDB types that can have variable sizes:
arrays
strings
The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take.
Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.ids might be very large (in practice you should probably keep a separate field like friendsCount in sync with the array, but for the sake of example, we'll assume that's not available):
db.users.aggregate([
{ $match: {
'friends.ids': { $exists: true }
}},
{ $project: {
sizeLargestField: { $size: '$friends.ids' }
}},
{ $sort: {
sizeLargestField: -1
}},
])
The key is to use the $size aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytes operator. Let's say we suspect the bio field might also be very large:
db.users.aggregate([
{ $match: {
bio: { $exists: true }
}},
{ $project: {
sizeLargestField: { $strLenBytes: '$bio' }
}},
{ $sort: {
sizeLargestField: -1
}},
])
You can also combine $size and $strLenBytes using $sum to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays.
1 Technically, the rarely used binData type can also have variable size.
Well.. this is an old question.. but - I thought to share my cent about it
My approach - use Mongo mapReduce function
First - let's get the size for each document
db.myColection.mapReduce
(
function() { emit(this._id, Object.bsonsize(this)) }, // map the result to be an id / size pair for each document
function(key, val) { return val }, // val = document size value (single value for each document)
{
query: {}, // query all documents
out: { inline: 1 } // just return result (don't create a new collection for it)
}
)
This will return all documents sizes although it worth mentioning that saving it as a collection is a better approach (the result is an array of results inside the result field)
Second - let's get the max size of document by manipulating this query
db.metadata.mapReduce
(
function() { emit(0, Object.bsonsize(this))}, // mapping a fake id (0) and use the document size as value
function(key, vals) { return Math.max.apply(Math, vals) }, // use Math.max function to get max value from vals (each val = document size)
{ query: {}, out: { inline: 1 } } // same as first example
)
Which will provide you a single result with value equals to the max document size
In short:
you may want to use the first example and save its output as a collection (change out option to the name of collection you want) and applying further aggregations on it (max size, min size, etc.)
-OR-
you may want to use a single query (the second option) for getting a single stat (min, max, avg, etc.)
If you're working with a huge collection, loading it all at once into memory will not work, since you'll need more RAM than the size of the entire collection for that to work.
Instead, you can process the entire collection in batches using the following package I created:
https://www.npmjs.com/package/mongodb-largest-documents
All you have to do is provide the MongoDB connection string and collection name. The script will output the top X largest documents when it finishes traversing the entire collection in batches.
Inspired by Elad Nana's package, but usable in a MongoDB console :
function biggest(collection, limit=100, sort_delta=100) {
var documents = [];
cursor = collection.find().readPref("nearest");
while (cursor.hasNext()) {
var doc = cursor.next();
var size = Object.bsonsize(doc);
if (documents.length < limit || size > documents[limit-1].size) {
documents.push({ id: doc._id.toString(), size: size });
}
if (documents.length > (limit + sort_delta) || !cursor.hasNext()) {
documents.sort(function (first, second) {
return second.size - first.size;
});
documents = documents.slice(0, limit);
}
}
return documents;
}; biggest(db.collection)
Uses cursor
Gives a list of the limit biggest documents, not just the biggest
Sort & cut output list to limit every sort_delta
Use nearest as read preference (you might also want to use rs.slaveOk() on the connection to be able to list collections if you're on a slave node)
As Xavier Guihot already mentioned, a new $bsonSize aggregation operator was introduced in Mongo 4.4, which can give you the size of the object in bytes. In addition to that just wanted to provide my own example and some stats.
Usage example:
// I had an `orders` collection in the following format
[
{
"uuid": "64178854-8c0f-4791-9e9f-8d6767849bda",
"status": "new",
...
},
{
"uuid": "5145d7f1-e54c-44d9-8c10-ca3ce6f472d6",
"status": "complete",
...
},
...
];
// and I've run the following query to get documents' size
db.getCollection("orders").aggregate(
[
{
$match: { status: "complete" } // pre-filtered only completed orders
},
{
$project: {
uuid: 1,
size: { $bsonSize: "$$ROOT" } // added object size
}
},
{
$sort: { size: -1 }
},
],
{ allowDiskUse: true } // required as I had huge amount of data
);
as a result, I received a list of documents by size in descending order.
Stats:
For the collection of ~3M records and ~70GB size in total, the query above took ~6.5 minutes.

finding duplicates using map reduce from mongodb

I need to find the duplicates in a collection in mongo db which has around 20000 documents. The result should give me the key (on which I am grouping) and the count of times they are repeated only if the count is greater than 1. The below is not complete, however it is giving an error also when I run in mongo.exe shell :
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
}
reduce : function Reduce(key, vals) {
return Array.sum(vals);
}
finalize : function Finalize(key, reduced) {
return reduced
}
out : { inline : 1 }
});
SyntaxError: missing } after property list (shell):5
why is the above error coming?
how can only get the ones with count greater than 1?
I'm not sure if that is an exact copy of the code you've entered, but it looks like you're missing commas between the fields in the object being passed to runCommand. Try:
db.runCommand({ mapreduce: users,
map : function Map() {
emit(this.emailId, 1);
},
reduce : function Reduce(key, vals) {
return Array.sum(vals);
},
finalize : function Finalize(key, reduced) {
return reduced
},
out : { inline : 1 }
});
Also note that even when using finalize, you can't actually remove entries from the outputted document (or collection) in a single-pass with Map-Reduce. However, whether you're using out: {inline: 1}, or out: "some_collection", it is pretty trivial to filter out results where the count is 1.