Query if duplicates in array without use "aggregate()", only "find()", with mongodb - mongodb

I have an intermediate experience of Mongodb, and Javascript is not a language I master a lot.
I used the solution proposed in this topic but this is very heavy for my RAM.
I found partially an other way to solve my problem.
Inspired by this page and Kamil Naja's answer I wrote this code:
db.coll.find(
{ $where:
function() { return (new Set(this.field).size !== this.field.length)}
}
)
It's more convenient to write, it's faster but it misses something particular for my problematic. I only want to count the duplicates of integer numbers.
For instance, here two arrays with different contents and both have duplicates but not from the same type:
[1,2,3,4,5,6,6,7,8,'a','t'], array in field from file 1
[1,2,3,4,5,6,7,8,'a','a'], array in field from file 2
With the current code above, it will select the two files, while I want the query to only return the file 1 because there are duplicates of integers.
How can I implement this condition, still using find() and not aggregate() ?

We can take into account only numbers with Array.filter() method.
db.coll.find({ $where:
function() { return (new Set(this.field.filter(x => typeof x === 'number')).size !== this.field.filter(x => typeof x === 'number').length)}
})

Related

sort by string length in Mongodb/pymongo

I was wondering if anyone knows how to sort a mongodb find() result by string length.
I have tried something like db.foo.find().sort({item.lenght:-1}) but obviously doesn't work. Can somebody help me and also suggest me a way to do the same thing but in pymongo?
There are lot of things ( and basic API ) I would personally love to see in the aggregation framework such as:
Math functions
log (as in logarithm)
ceil
floor
Array
sum
String
length
Just to name a few.
And that is without resorting to obscure usages of the $mod operator or other means in such cases as "ceil" and "floor". But I digress.
Your "string length" falls into this category. Raise a JIRA issue about it. But for now you you can use mapReduce and the existing JavaScript functionality:
db.collection.mapReduce(
function() {
emit( this.item.length, this.item );
},
function(key,values) {
return values;
},
{ "out": { "inline": 1 } }
)
So while that does actually have the "mapReduce" funky style of returning a re-shaped document and with of course everything matching the same length in an array, what it does do is take advantage of the nature of "mapReduce" ( not just restricted to MongoDB ) and allows the emitted "key" value to be sorted in the response.
There is now a solution for this in MongoDB v3.4+ using the aggregation framework using $strLenBytes. Given the following document:
{_id: 0, name: "Bob"}
We can use
db.mycollection.aggregate([{
$project: {
byteLength: {$strLenBytes: "$name"}
}
}])
Which will return 3 for the number of bytes.
No, actually is not possible.
I was dealing with a similar problem, what I did was to store the string length of every object as a property of the object itself. This bypassed the problem.
If you think that shall be implemented (I do) I recomend you to upvote the issue in JIRA, which, for some reason have not so many votes:
https://jira.mongodb.org/browse/SERVER-5319

Fetch Record from mongo db based on type and ancestry field

in mongodb records are store like this
{_id:100,type:"section",ancestry:nil,.....}
{_id:300,type:"section",ancestry:100,.....}
{_id:400,type:"problem",ancestry:100,.....}
{_id:500,type:"section",ancestry:100,.....}
{_id:600,type:"problem",ancestry:500,.....}
{_id:700,type:"section",ancestry:500,.....}
{_id:800,type:"problem",ancestry:100,.....}
i want to fetch records in order like this
first record whose ancestry is nil
then all record whose parent is first record we search and whose type is 'problem'
then all record whose parent is first record we search and whose type is 'section'
Expected output is
{_id:100,type:"section",ancestry:nil,.....}
{_id:400,type:"problem",ancestry:100,.....}
{_id:800,type:"problem",ancestry:100,.....}
{_id:300,type:"section",ancestry:100,.....}
{_id:500,type:"section",ancestry:100,.....}
{_id:600,type:"problem",ancestry:500,.....}
{_id:700,type:"section",ancestry:500,.....}
Try this MongoDB shell command:
db.collection.find().sort({ancestry:1, type: 1})
Different languages, where ordered dictionaries aren't available, may use a list of 2-tuples to the sort argument. Something like this (Python):
collection.find({}).sort([('ancestry', pymongo.ASCENDING), ('type', pymongo.ASCENDING)])
#vinipsmaker 's answer is good. However, it doesn't work properly if _ids are random numbers or there exist documents that aren't part of the tree structure. In that case, the following code would work rightly:
function getSortedItems() {
var sorted = [];
var ids = [ null ];
while (ids.length > 0) {
var cursor = db.Items.find({ ancestry: ids.shift() }).sort({ type: 1 });
while (cursor.hasNext()) {
var item = cursor.next();
ids.push(item._id);
sorted.push(item);
}
}
return sorted;
}
Note that this code is not fast because db.Items.find() will be executed n times, where n is the number of documents in the tree structure.
If the tree structure is huge or you will do the sort many times, you can optimize this by using $in operator in the query and sort the result on the client side.
In addition, creating index on the ancestry field will make the code quicker in either case.

MongoDB MapReduce - Emit one key/one value doesnt call reduce

So i'm new with mongodb and mapreduce in general and came across this "quirk" (or atleast in my mind a quirk)
Say I have objects in my collection like so:
{'key':5, 'value':5}
{'key':5, 'value':4}
{'key':5, 'value':1}
{'key':4, 'value':6}
{'key':4, 'value':4}
{'key':3, 'value':0}
My map function simply emits the key and the value
My reduce function simply adds the values AND before returning them adds 1 (I did this to check to see if the reduce function is even called)
My results follow:
{'_id': 3, 'value': 0}
{'_id':4, 'value': 11.0}
{'_id':5, 'value': 11.0}
As you can see, for the keys 4 & 5 I get the expected answer of 11 BUT for the key 3 (with only one entry in the collection with that key) I get the unexpected 0!
Is this natural behavior of mapreduce in general? For MongoDB? For pymongo (which I am using)?
The reduce function combines documents with the same key into one document. If the map function emits a single document for a particular key (as is the case with key 3), the reduce function will not be called.
I realize this is an older question, but I came to it and felt like I still didn't understand why this behavior exists and how to build map/reduce functionality so it's a non-issue.
The reason MongoDB doesn't call the reduce function if there is a single instance of a key is because it isn't necessary (I hope this will make more sense in a moment). The following are requirements for reduce functions:
The reduce function must return an object whose type must be identical to the type of the value emitted by the map function.
The order of the elements in the valuesArray should not affect the output of the reduce function
The reduce function must be idempotent.
The first requirement is very important and it seems a number of people are overlooking it because I've seen a number of people mapping in the reduce function then dealing with the single-key case in the finalize function. This is the wrong way to address the issue, however.
Think about it like this: If there's only a single instance of a key, a simple optimization is to skip the reducer entirely (there's nothing to reduce). Single-key values are still included in the output, but the intent of the reducer is to build an aggregate result of the multi-key documents in your collection. If the mapper and reducer are outputting the same type, you should be blissfully unaware by looking at the object structure of the output from your map/reduce functions. You shouldn't have to use a finalize function to correct the structure of your objects that didn't run through the reducer.
In short, do your mapping in your map function and reduce multi-key values into a single, aggregate result in your reduce functions.
Solution:
added new field in map: single: 0
in reduce change this field to: single: 1
in finalize make checking for this field and make required actions
$map = new MongoCode("function() {
var value = {
time: this.time,
email_id: this.email_id,
single: 0
};
emit(this.email, value);
}");
$reduce = new MongoCode("function(k, vals) {
// make some need actions here
return {
time: vals[0].time,
email_id: vals[0].email_id,
single: 1
};
}");
$finalize = new MongoCode("function(key, reducedVal) {
if (reducedVal.single == 0) {
reducedVal.time = 11111;
}
return reducedVal;
};");
"MongoDB will not call the reduce function for a key that has only a single value. The values argument is an array whose elements are the value objects that are “mapped” to the key."
http://docs.mongodb.org/manual/reference/command/mapReduce/#mapreduce-reduce-cmd
Is this natural behavior of mapreduce in general?
Yes.

Ordering a result set randomly in mongo

I've recently discovered that Mongo has no SQL equivalent to "ORDER BY RAND()" in it's command syntax (https://jira.mongodb.org/browse/SERVER-533)
I've seen the recommendation at http://cookbook.mongodb.org/patterns/random-attribute/ and frankly, adding a random attribute to a document feels like a hack. This won't work because this places an implicit limit to any given query I want to randomize.
The other widely given suggestion is to choose a random index to offset from. Because of the order that my documents were inserted in, that will result in one of the string fields being alphabetized, which won't feel very random to a user of my site.
I have a couple ideas on how I could solve this via code, but I feel like I'm missing a more obvious and native solution. Does anyone have a thought or idea on how to solve this more elegantly?
I have to agree: the easiest thing to do is to install a random value into your documents. There need not be a tremendously large range of values, either -- the number you choose depends on the expected result size for your queries (1,000 - 1,000,000 distinct integers ought to be enough for most cases).
When you run your query, don't worry about the random field -- instead, index it and use it to sort. Since there is no correspondence between the random number and the document, you should get fairly random results. Note that collisions will likely result in documents being returned in natural order.
While this is certainly a hack, you have a very easy escape route: given MongoDB's schema-free nature, you can simply stop including the random field once there is support for random sort in the server. If size is an issue, you could run a batch job to remove the field from existing documents. There shouldn't be a significant change in your client code if you design it carefully.
An alternative option would be to think long and hard about the number of results that will be randomized and returned for a given query. It may not be overly expensive to simply do shuffling in client code (i.e., if you only consider the most recent 10,000 posts).
What you want cannot be done without picking either of the two solutions you mention. Picking a random offset is a horrible idea if your collection becomes larger than a few thousands documents. The reason for this is that the skip(n) operation takes O(n) time. In other words, the higher your random offset the longer the query will take.
Adding a randomized field to the document is, in my opinion, the least hacky solution there is given the current feature set of MongoDB. It provides stable query times and gives you some say over how the collection is randomized (and allows you to generate a new random value after each query through a findAndModify for example). I also do not understand how this would impose an implicit limit on your queries that make use of randomization.
You can give this a try - it's fast, works with multiple documents and doesn't require populating rand field at the beginning, which will eventually populate itself:
add index to .rand field on your collection
use find and refresh, something like:
// Install packages:
// npm install mongodb async
// Add index in mongo:
// db.ensureIndex('mycollection', { rand: 1 })
var mongodb = require('mongodb')
var async = require('async')
// Find n random documents by using "rand" field.
function findAndRefreshRand (collection, n, fields, done) {
var result = []
var rand = Math.random()
// Append documents to the result based on criteria and options, if options.limit is 0 skip the call.
var appender = function (criteria, options, done) {
return function (done) {
if (options.limit > 0) {
collection.find(criteria, fields, options).toArray(
function (err, docs) {
if (!err && Array.isArray(docs)) {
Array.prototype.push.apply(result, docs)
}
done(err)
}
)
} else {
async.nextTick(done)
}
}
}
async.series([
// Fetch docs with unitialized .rand.
// NOTE: You can comment out this step if all docs have initialized .rand = Math.random()
appender({ rand: { $exists: false } }, { limit: n - result.length }),
// Fetch on one side of random number.
appender({ rand: { $gte: rand } }, { sort: { rand: 1 }, limit: n - result.length }),
// Continue fetch on the other side.
appender({ rand: { $lt: rand } }, { sort: { rand: -1 }, limit: n - result.length }),
// Refresh fetched docs, if any.
function (done) {
if (result.length > 0) {
var batch = collection.initializeUnorderedBulkOp({ w: 0 })
for (var i = 0; i < result.length; ++i) {
batch.find({ _id: result[i]._id }).updateOne({ rand: Math.random() })
}
batch.execute(done)
} else {
async.nextTick(done)
}
}
], function (err) {
done(err, result)
})
}
// Example usage
mongodb.MongoClient.connect('mongodb://localhost:27017/core-development', function (err, db) {
if (!err) {
findAndRefreshRand(db.collection('profiles'), 1024, { _id: true, rand: true }, function (err, result) {
if (!err) {
console.log(result)
} else {
console.error(err)
}
db.close()
})
} else {
console.error(err)
}
})
The other widely given suggestion is to choose a random index to offset from. Because of the order that my documents were inserted in, that will result in one of the string fields being alphabetized, which won't feel very random to a user of my site.
Why? If you have 7.000 documents and you choose three random offsets from 0 to 6999, the chosen documents will be random, even if the collection itself is sorted alphabetically.
One could insert an id field (the $id field won't work because its not an actual number) use modulus math to get a random skip. If you have 10,000 records and you wanted 10 results you could pick a modulus between 1 and 1000 randomly sucH as 253 and then request where mod(id,253)=0 and this is reasonably fast if id is indexed. Then randomly sort client side those 10 results. Sure they are evenly spaced out instead of truly random, but it close to what is desired.
Both of the options seems like non-perfect hacks to me, random filed and will always have same value and skip will return same records for a same number.
Why don't you use some random field to sort then skip randomly, i admit it is also a hack but in my experience gives better sense of randomness.

selecting all the fields in a row using mapReduce

I am using mongoose with nodejs. I am using mapReduce to fetch data grouped by a field.So all it gives me as a collection is the key with the grouping field only from every row of database.
I need to fetch all the fields from the database grouped by a field and sorted on the basis of another field.e.g.: i have a database having details of places and fare for travelling to those places and a few other fields also.Now i need to fetch the data in such a way that i get the data grouped on the basis of places sorted by the fare for them. MapReduce helps me to get that, but i cannot get the other fields.
Is there a way to get all the fields using map reduce, rather than just getting the two fields as mentioned in the above example??
I must admit I'm not sure I understand completely what you're asking.
But maybe one of the following thoughts helps you:
either) when you iterate over your mapReduce results, you could fetch complete documents from mongodb for each result. That would give you access to all fields in each document for the cost of some network traffic.
or) The value that you send into emit(key, value) can be an object. So you could construct a value object that contains all your desired fields. Just be sure to use the exactly same object structure for your reduce method's return value.
I try to illustrate with an (untested) example.
map = function() {
emit(this.place,
{
'field1': this.field1,
'field2': this.field2,
'count' : 1
});
}
reduce = function(key, values) {
var result = {
'field1': values[0].field1,
'field2': values[0].field2,
'count' : 0 };
for (v in values) {
result.count += values[v].count;
}
return obj;
}