Is there a way to project the type of a field - mongodb

Suppose we had something like the following document, but we wanted to return only the fields that had numeric information:
{
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"name" : "Mikey",
"list" : [ 1, 2, 3, 4, 5 ],
"people" : [ "Fred", "Barney", "Wilma", "Betty" ],
"status" : false,
"created" : ISODate("2014-02-12T00:37:40.534Z"),
"views" : 5
}
Now I know that we can query for fields that match a certain type by use of the $type operator. But I'm yet to stumble upon a way to $project this as a field value. So if we looked at the document in the "unwound" form you would see this:
{
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"name" : 2,
"list" : 16,
"people" : 2
"status" : 8,
"created" : 9,
"views" : 16
}
The final objective would be list only the fields that matched a certain type, let's say compare to get the numeric types and filter out the fields, after much document mangling, to produce a result as follows:
{
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"list" : [ 1, 2, 3, 4, 5 ],
"views" : 5
}
Does anyone have an approach to handle this.

There are a few issues that make this not practical:
Since the query is a distinctive parameter from the ability to do a projection, this isn't possible from a single query alone, as the projection cannot be influenced by the results of the query
As there's no way with the aggregation framework to iterate fields and check type, that's also not an option
That being said, there's a slightly whacky way of using a Map-Reduce that does get similar answers, albeit in a Map-Reduce style output that's not awesome:
map = function() {
function isNumber(n) {
return !isNaN(parseFloat(n)) && isFinite(n);
}
var numerics = [];
for(var fn in this) {
if (isNumber(this[fn])) {
numerics.push({f: fn, v: this[fn]});
}
if (Array.isArray(this[fn])) {
// example ... more complex logic needed
if(isNumber(this[fn][0])) {
numerics.push({f: fn, v: this[fn]});
}
}
}
emit(this._id, { n: numerics });
};
reduce = function(key, values) {
return values;
};
It's not complete, but the results are similar to what you wanted:
"_id" : ObjectId("52fac254f40ff600c10e56d4"),
"value" : {
"n" : [
{
"f" : "list",
"v" : [
1,
2,
3,
4,
5
]
},
{
"f" : "views",
"v" : 5
}
]
}
The map is just looking at each property and deciding whether it looks like a number ... and if so, adding to an array that will be stored as an object so that the map-reduce engine won't choke on array output. I've kept it simple in the example code -- you could improve the logic of numeric and array checking for sure. :)
Of course, it's not live like a find or aggregation, but as MongoDB wasn't designed with this in mind, this may have to do if you really wanted this functionality.

Related

How to Find the Equivalent of Broken "Foreign Key" Relationships?

I have two collections which are in what we would call a "one to one relationship" if this were a relational database. I do not know why one is not nested within the other, but the fact is that for every document in collection "A", there is meant to be a document in collection "B" and vice versa.
Of course, in the absence of foreign key constraints, and in the presence of bugs, sometimes there is a document in "A" which does not have a related document in "B" (or vice versa).
I am new to MongoDB and I'm having trouble creating a query or script which will find me all the documents in "A" which do not have a related document in "B" (and vice versa). I guess I could use some sort of loop, but I don't yet know how that would work - I've only just started using simple queries on the RoboMongo command line.
Can anyone get me started with scripts for this? I have seen "Verifying reference (foreign key) integrity in MongoDB", but that doesn't help me. A bug has caused the "referential integrity" to break down, and I need the scripts in order to help me track down the bug. I also cannot redesign the database to use embedding (though I expect I'll ask why one document is not nested within the other).
I have also seen "How to find items in a collections which are not in another collection with MongoDB", but it has no answers.
Pseudo-Code for a Bad Technique
var misMatches = [];
var collectionB = db.getCollection('B');
var allOfA = db.getCollection('A').find();
while (allOfA.hasNext()) {
var nextA = allOfA.next();
if (!collectionB.find(nextA._id)) {
misMatches.push(nextA._id);
}
}
I don't know if this scales well, but ...
... given this sample date set:
> db.a.insert([{a:1},{a:2},{a:10} ])
> db.b.insert([ {b:2},{b:10},{b:20}])
// ^^^^^ ^^^^^^
// inconsistent 1-to-1 relationship
You could use map-reduce to collect the set of key in a and merge it with the set of key from b:
mapA=function() {
emit(this.a, {col: ["a"]})
}
mapB=function() {
emit(this.b, {col: ["b"]})
}
reduce=function(key, values) {
// merge both `col` arrays; sort the result
return {col: values.reduce(
function(a,b) { return a.col.concat(b.col) }
).sort()}
}
Producing:
> db.a.mapReduce(mapA, reduce, {out:{replace:"result"}})
> db.b.mapReduce(mapB, reduce, {out:{reduce:"result"}})
> db.result.find()
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 2, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 10, "value" : { "col" : [ "a", "b" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] } }
Then it is quite easy to find all id that wasn't found in collection a and b. In addition you should be able to spot duplicate keys in one or the other collection:
> db.result.find({"value.col": { $ne: [ "a", "b" ]}})
{ "_id" : 1, "value" : { "col" : [ "a" ] } }
{ "_id" : 20, "value" : { "col" : [ "b" ] }

Resolving MongoDB DBRef array using Mongo Native Query and working on the resolved documents

My MongoDB collection is made up of 2 main collections :
1) Maps
{
"_id" : ObjectId("542489232436657966204394"),
"fileName" : "importFile1.json",
"territories" : [
{
"$ref" : "territories",
"$id" : ObjectId("5424892224366579662042e9")
},
{
"$ref" : "territories",
"$id" : ObjectId("5424892224366579662042ea")
}
]
},
{
"_id" : ObjectId("542489262436657966204398"),
"fileName" : "importFile2.json",
"territories" : [
{
"$ref" : "territories",
"$id" : ObjectId("542489232436657966204395")
}
],
"uploadDate" : ISODate("2012-08-22T09:06:40.000Z")
}
2) Territories, which are referenced in "Map" objects :
{
"_id" : ObjectId("5424892224366579662042e9"),
"name" : "Afghanistan",
"area" : 653958
},
{
"_id" : ObjectId("5424892224366579662042ea"),
"name" : "Angola",
"area" : 1252651
},
{
"_id" : ObjectId("542489232436657966204395"),
"name" : "Unknown",
"area" : 0
}
My objective is to list every map with their cumulative area and number of territories. I am trying the following query :
db.maps.aggregate(
{'$unwind':'$territories'},
{'$group':{
'_id':'$fileName',
'numberOf': {'$sum': '$territories.name'},
'locatedArea':{'$sum':'$territories.area'}
}
})
However the results show 0 for each of these values :
{
"result" : [
{
"_id" : "importFile2.json",
"numberOf" : 0,
"locatedArea" : 0
},
{
"_id" : "importFile1.json",
"numberOf" : 0,
"locatedArea" : 0
}
],
"ok" : 1
}
I probably did something wrong when trying to access to the member variables of Territory (name and area), but I couldn't find an example of such a case in the Mongo doc. area is stored as an integer, and name as a string.
I probably did something wrong when trying to access to the member variables of Territory (name and area), but I couldn't find an example
of such a case in the Mongo doc. area is stored as an integer, and
name as a string.
Yes indeed, the field "territories" has an array of database references and not the actual documents. DBRefs are objects that contain information with which we can locate the actual documents.
In the above example, you can clearly see this, fire the below mongo query:
db.maps.find({"_id":ObjectId("542489232436657966204394")}).forEach(function(do
c){print(doc.territories[0]);})
it will print the DBRef object rather than the document itself:
o/p: DBRef("territories", ObjectId("5424892224366579662042e9"))
so, '$sum': '$territories.name','$sum': '$territories.area' would show you '0' since there are no fields such as name or area.
So you need to resolve this reference to a document before doing something like $territories.name
To achieve what you want, you can make use of the map() function, since aggregation nor Map-reduce support sub queries, and you already have a self-contained map document, with references to its territories.
Steps to achieve:
a) get each map
b) resolve the `DBRef`.
c) calculate the total area, and the number of territories.
d) make and return the desired structure.
Mongo shell script:
db.maps.find().map(function(doc) {
var territory_refs = doc.territories.map(function(terr_ref) {
refName = terr_ref.$ref;
return terr_ref.$id;
});
var areaSum = 0;
db.refName.find({
"_id" : {
$in : territory_refs
}
}).forEach(function(i) {
areaSum += i.area;
});
return {
"id" : doc.fileName,
"noOfTerritories" : territory_refs.length,
"areaSum" : areaSum
};
})
o/p:
[
{
"id" : "importFile1.json",
"noOfTerritories" : 2,
"areaSum" : 1906609
},
{
"id" : "importFile2.json",
"noOfTerritories" : 1,
"areaSum" : 0
}
]
Map-Reduce functions should not be and cannot be used to resolve DBRefs in the server side.
See what the documentation has to say:
The map function should not access the database for any reason.
The map function should be pure, or have no impact outside of the
function (i.e. side effects.)
The reduce function should not access the database, even to perform
read operations. The reduce function should not affect the outside
system.
Moreover, a reduce function even if used(which can never work anyway) will never be called for your problem, since a group w.r.t "fileName" or "ObjectId" would always have only one document, in your dataset.
MongoDB will not call the reduce function for a key that has only a
single value

mongodb $elemMatch in query return all sub docs

db.aaa.insert({"_id":1, "m":[{"_id":1,"a":1},{"_id":2,"a":2}]})
db.aaa.find({"_id":1,"m":{$elemMatch:{"_id":1}}})
{ "_id" : 1, "m" : [ { "_id" : 1, "a" : 1 }, { "_id" : 2, "a" : 2 } ] }
Using $elemMatch as query operator, it return all sub docs in 'm' !! Strange!
Use it as project operator:
db.aaa.find({"_id":1},{"m":{$elemMatch:{"_id":1}}})
{ "_id" : 1, "m" : [ { "_id" : 1, "a" : 1 } ] }
This is OK. Following this logic, use it as query operator in update will change all sub docs in 'm'. So I do:
db.aaa.update({"_id":1,"m":{$elemMatch:{"_id":1}}},{$set:{"m.$.a":3}})
db.aaa.find()
{ "_id" : 1, "m" : [ { "_id" : 1, "a" : 3 }, { "_id" : 2, "a" : 2 } ] }
It works in manner of as second example(project operator). This really confuse me.
Give me a explain
It Isn't strange, it's how it works.
You are using $elemMatch to match an element within an array contained in your document. That means it mactches the "document" and not the "array element", so it does not just selectively display only the array element that was matched.
What you can do, and how you used it in with the $set operator, is use a positional $ operator to indicate the matched "position" from your query side:
db.aaa.find({"_id":1},{"m":{$elemMatch:{"_id":1}}},{ "m.$": 1 })
And that will show you only one element of the array. But it is of course *still an array in the result shown, and you cannot cast it to a different type.
The other part of the usage is that this will only match once. And only the first match will be assigned to the positional operator.
So perhaps the most succinct explaination is you matching the "document that contains" the properties of the sub-document your specified in your query, and not just the "sub-document" itself.
See the documentation for more:
http://docs.mongodb.org/manual/reference/operator/projection/positional/
http://docs.mongodb.org/manual/reference/operator/query/elemMatch/

Listing, counting factors of unique Mongo DB values over all keys

I'm preparing a descriptive "schema" (quelle horreur) for a MongoDB I've been working with.
I used the excellent variety.js to create a list of all keys and show coverage of each key. However, in cases where the values corresponding to the keys have a small set of values, I'd like to be able to list the entire set as "available values." In R, I'd be thinking of these as the "factors" for the categorical variable, ie, gender : ["M", "F"].
I know I could just use R + RMongo, query each variable, and basically do the same procedure I would to create a histogram, but I'd like to know the proper Mongo.query()/javascript/Map,Reduce way to approach this. I understand the db.collection.aggregate() functions are designed for exactly this.
Before asking this, I referenced:
http://docs.mongodb.org/manual/reference/aggregation/
http://docs.mongodb.org/manual/reference/method/db.collection.distinct/
How to query for distinct results in mongodb with python?
Get a list of all unique tags in mongodb
http://cookbook.mongodb.org/patterns/count_tags/
But can't quite get the pipeline order right. So, for example, if I have documents like these:
{_id : 1, "key1" : "value1", "key2": "value3"}
{_id : 2, "key1" : "value2", "key2": "value3"}
I'd like to return something like:
{"key1" : ["value1", "value2"]}
{"key2" : ["value3"]}
Or better, with counts:
{"key1" : ["value1" : 1, "value2" : 1]}
{"key2" : ["value3" : 2]}
I recognize one problem with doing this will be any values that have a wide range of different values---so, text fields, or continuous variables. Ideally, if there were more than x different possible values, it would be nice to truncate, say to no more than 20 unique values. If I find it's actually more, I'd query that variable directly.
Is this something like:
db.collection.aggregate(
{$limit: 20,
$group: {
_id: "$??varname",
count: {$sum: 1}
}})
First, how can I reference ??varname? for the name of each key?
I saw this link which had 95% of it:
Binning and tabulate (unique/count) in Mongo
with...
input data:
{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }
This script:
db.collection.aggregate(
{$project: {gender:1}},
{$group: {
_id: "$gender",
count: {$sum: 1}
}})
Produces:
{"result" :
[
{"_id" : "m", "count" : 2},
{"_id" : "f", "count" : 3}
],
"ok" : 1
}
But what I don't understand is how could I do this generically for an unknown number/name of keys with a potentially large number of return values? This sample knows the key name is gender, and that the response set will be small (2 values).
If you already ran a script that outputs the names of all keys in the collection, you can generate your aggregation framework pipeline dynamically. What that means is either extending the variety.js type script or just writing your own.
Here is what it might look like in JS if passed an array called "keys" which has several non-"_id" named fields (I'm assuming top level fields and that you don't care about arrays, embedded documents, etc).
keys = ["key1", "key2"];
group = { "$group" : { "_id" : null } } ;
keys.forEach( function(f) {
group["$group"][f+"List"] = { "$addToSet" : "$" + f }; } );
db.collection.aggregate(group);
{
"result" : [
{
"_id" : null,
"key1List" : [
"value2",
"value1"
],
"key2List" : [
"value3"
]
}
],
"ok" : 1
}

is map/reduce appropriate for finding the median and mode of a set of values for many records?

I have a set of objects in Mongodb that each have a set of values embedded in them, e.g.:
[1.22, 12.87, 1.24, 1.24, 9.87, 1.24, 87.65] // ... up to about 150 values
Is a map/reduce the best solution for finding the median (average) and mode (most common value) in the embedded arrays? The reason that I ask is that the map and the reduce both have to return the same (structurally) set of values. It looks like in my case I want to take in a set of values (the array) and return a set of two values (median, mode).
If not, what's the best way to approach this? I want it to run in a rake task, if that's relevant. It'd be an overnight data crunching kind of thing.
There's a key question here regarding the expected output. It's not 100% clear from your question which one you want.
Do you want (A):
{ _id: "document1", value: { mode: 1.0, median: 10.0 } }
{ _id: "document2", value: { mode: 5.0, median: 150.0 } }
... one for each document
... or do you want (B), the mode and median across all the combination of all arrays.
If the answer is (A), then Map/Reduce will work.
If the answer is (B), then Map/Reduce will probably not work.
If you plan to do (A), please read the M/R documentation carefully and understand the limitations. While option (A) can be a Map/Reduce, it can also just be a big for loop with an upsert on the "summary" collection or even back into the original collection. This may be even more efficient.
I would start by read this http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/.
I think you want
A map stage to generate your key and single data element,
A reduce stage to place all the data elements into
a data array for each key,
A finalize stage to perform your mean, median and mode
operations on the entire collection.
Finalize Function
A finalize function may be run after reduction.
Such a function is optional and is not necessary for many map/reduce
cases. The finalize function takes a key and a value, and returns a
finalized value.
function finalize(key, value) -> final_value
Your reduce function may
be called multiple times for the same object. Use finalize when
something should only be done a single time at the end; for example
calculating an average.
Taken from http://www.mongodb.org/display/DOCS/MapReduce
I assume you want to find the mode & median of each document, you can do this with map reduce. In this case you calculate median & mode in the map function and reduce will return the map result untouched
map = function() {
var res = 0;
for (i = 0; i < this.marks.length; i++) {
res = res + this.marks[i];
}
var median = res/this.marks.length;
emit(this._id,{marks:this.marks,median:median});
}
reduce = function (k, values) {
values.forEach(function(value) {
result = value;
});
return result;
}
and for this collection
{ "_id" : ObjectId("4f02be1f1ae045175f0eb9f1"), "name" : "ram", "marks" : [ 1.22, 12.87, 1.24, 1.24, 9.87, 1.24, 87.65 ] }
{ "_id" : ObjectId("4f02be371ae045175f0eb9f2"), "name" : "sam", "marks" : [ 1.32, 11.87, 12.4, 4.24, 9.37, 3.24, 7.65 ] }
{ "_id" : ObjectId("4f02be4c1ae045175f0eb9f3"), "name" : "pam", "marks" : [ 3.32, 10.17, 11.4, 2.24, 2.37, 3.24, 30.65 ] }
you can get the median by
db.test.mapReduce(map,reduce,{out: { inline : 1}})
{
"results" : [
{
"_id" : ObjectId("4f02be1f1ae045175f0eb9f1"),
"value" : {
"marks" : [
1.22,
12.87,
1.24,
1.24,
9.87,
1.24,
87.65
],
"median" : 16.475714285714286
}
},
{
"_id" : ObjectId("4f02be371ae045175f0eb9f2"),
"value" : {
"marks" : [
1.32,
11.87,
12.4,
4.24,
9.37,
3.24,
7.65
],
"median" : 7.155714285714285
}
},
{
"_id" : ObjectId("4f02be4c1ae045175f0eb9f3"),
"value" : {
"marks" : [
3.32,
10.17,
11.4,
2.24,
2.37,
3.24,
30.65
],
"median" : 9.055714285714286
}
}
],
"timeMillis" : 1,
"counts" : {
"input" : 3,
"emit" : 3,
"reduce" : 0,
"output" : 3
},
"ok" : 1,
}