How do I translate this sql query to mongodb
Select * From Users Where type = "S" and registration_token = username;
I've tried this
$users = User::model()->findAll(array(
"type" => "S",
"registration_token" => "username"
));
but no joy...
This is a relational query, so please keep in mind that MongoDB isn't geared towards this kind of operation like SQL is. This type of query will usually be significantly slower than in an SQL database. I'd also consider this type of 'meta-logic' bad design, because the fact that two seemingly independent values match shouldn't mean anything.
That said, you still have two options. You can use the JavaScript-based $where:
db.coll.find({$and : [
{"type" : "S"},
{"$where": function() { return this.registration_token === this.username; } }]})
However, this approach is slow because it needs to fire up JavaScript for each object it finds (i.e. for all those with type == 'S'). It might also have security implications if any of the data in the $where comes from the end user.
Alternatively, you can use the aggregation pipeline:
> db.coll.aggregate([{ "$project": {
"username": "$username",
"type": "$type",
"registration_token" : "$registration_token",
"match": { "$eq": ["$username","$registration_token"]} }},
{ "$match": { "match": true } } ])
Related
I'm in the process of updating some legacy software that is still running on Mongo 2.4. The first step is to upgrade to the latest 2.6 then go from there.
Running the db.upgradeCheckAllDBs(); gives us the DollarPrefixedFieldName: $id is not valid for storage. errors and indeed we have some older records with legacy $id, $ref fields. We have a number of collections that look something like this:
{
"_id" : "1",
"someRef" : {"$id" : "42", "$ref" : "someRef"}
},
{
"_id" : "2",
"someRef" : DBRef("someRef", "42")
},
{
"_id" : "3",
"someRef" : DBRef("someRef", "42")
},
{
"_id" : "4",
"someRef" : {"$id" : "42", "$ref" : "someRef"}
}
I want to script this to convert the older {"$id" : "42", "$ref" : "someRef"} objects to DBRef("someRef", "42") objects but leave the existing DBRef objects untouched. Unfortunately, I haven't been able to differentiate between the two types of objects.
Using typeof and $type simply say they are objects.
Both have $id and $ref fields.
In our groovy console when you pull one of the old ones back and one of the new ones getClass() returns DBRef for both.
We have about 80k records with this legacy format out of millions of total records. I'd hate to have to brute force it and modify every record whether it needs it or not.
This script will do what I need it to do but the find() will basically return all the records in the collection.
var cursor = db.someCollection.find({"someRef.$id" : {$exists: true}});
while(cursor.hasNext()) {
var rec = cursor.next();
db.someCollection.update({"_id": rec._id}, {$set: {"someRef": DBRef(rec.someRef.$ref, rec.someRef.$id)}});
}
Is there another way that I am missing that can be used to find only the offending records?
Update
As described in the accepted answer the order matters which made all the difference. The script we went with that corrected our data:
var cursor = db.someCollection.find(
{
$where: "function() { return this.someRef != null &&
Object.keys(this.someRef)[0] == '$id'; }"
}
);
while(cursor.hasNext()) {
var rec = cursor.next();
db.someCollection.update(
{"_id": rec._id},
{$set: {"someRef": DBRef(rec.someRef.$ref, rec.someRef.$id)}}
);
}
We did have a collection with a larger number of records that needed to be corrected where the connection timed out. We just ran the script again and it got through the remaining records.
There's probably a better way to do this. I would be interested in hearing about a better approach. For now, this problem is solved.
DBRef is a client side thing. http://docs.mongodb.org/manual/reference/database-references/#dbrefs says it pretty clear:
The order of fields in the DBRef matters, and you must use the above sequence when using a DBRef.
The drivers benefit from the fact that order of fields in BSON is consistent to recognise DBRef, so you can do the same:
db.someCollection.find({ $expr: {
$let: {
vars: {firstKey: { $arrayElemAt: [ { $objectToArray: "$someRef" }, 0] } },
in: { $eq: [{ $substr: [ "$$firstKey.k", 1, 2 ] } , "id"]}
}
} } )
will return objects where order of the fields doesn't match driver's expectation.
An example document looks like this
{
"_id":ObjectId("562e7c594c12942f08fe4192"),
"Type": "f",
"runTime": ISODate("2016-12-21T13:34:00.000+0000"),
"data" : {
"PRICES SPOT" : [
{
"value" : 29.64,
"timeStamp" : ISODate("2016-12-21T23:00:00.000+0000")
},
{
"value" : 29.24,
"timeStamp" : ISODate("2016-12-22T00:00:00.000+0000")
},
{
"value" : 29.81,
"timeStamp" : ISODate("2016-12-22T01:00:00.000+0000")
},
{
"value" : 30.2,
"timeStamp" : ISODate("2016-12-22T02:00:00.000+0000")
},
{
"value" : 29.55,
"timeStamp" : ISODate("2016-12-22T03:00:00.000+0000")
}
]
}
}
My MongoDb has different Type of documents, I'd like to get a cursor for all of the documents that are from a time range that are of type: "f" but that actually exist. There are some documents in the database that broke the code I had previously(which did not check if PRICES SPOT existed).
I saw that I can use $and and $exists from the documentation. However, I am having trouble setting it up because of the range, and the nesting. I am using pyMongo as my python driver and also noticed here that I have to wrap the $and and $exists in quotes.
My code
def grab_forecast_cursor(self, model_dt_from, model_dt_till):
# create cursor with all items that actually exist
cursor = self._collection.find(
{
"$and":[
{'Type': 'f', 'runTime': {"$gte": model_dt_from, "$lte": model_dt_till}
['data']['PRICES SPOT': "$exists": true]}
]})
return cursor
This results in a Key Error it cannot find data. A sample document that has no PRICE SPOT looks exactly like the one I posted in the beginning, just without that respectively.
In short.. Can someone help me set up a query in which I can grab a cursor with all the documents of a certain type but that actually have respected contents nested in.
Update
I added a comma after the model_dt_till and have now a syntax error.
def grab_forecast_cursor(self, model_dt_from, model_dt_till):
# create cursor with all items that actually exist
cursor = self._collection.find(
{
"$and":[
{'Type': 'f', 'runTime': {"$gte": model_dt_from, "$lte": model_dt_till},
['data']['PRICES SPOT': "$exists": true]}
]})
return cursor
You're trying to use Python syntax to denote the path to a data structure, but the "database" want's it's syntax for the "key" using "dot notation":
cursor = self._collection.find({
"Type": "f",
"runTime": { "$gte": model_dt_from, "$lte": model_dt_till },
"data.PRICES SPOT.0": { "$exists": True }
})
You also don't need to write $and like that as ALL MongoDB query conditions are already AND expressions, and part of your statement was actually doing that anyway, so make it consistent.
Also the check for a "non-empty" array is 'data.PRICES SPOT.0' with the added bonus that not only do you know it "exists", but also that it has at least one item to process within it
Python and JavaScript are almost identical in terms of object/dict construction, so you really should be able to just follow the general documentation and the many samples here that are predominantly JavaScript.
I personally even try to notate answers here with valid JSON, so it could be picked up and "parsed" by users of any language. But here, python is just identical to what you could enter into the mongo shell. Except for True of course.
See "Dot Notation" for an overview of the syntax with more information at Query on Embedded / Nested Documents
I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.
I have a document structure like so
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : {
"r152f47f1daf-0-2" : "c",
"r152f48413c1-0-2" : "c",
"r152f4851bf7-0-1" : "c"
}
}
My task is to find all documents with the following conditions:
The "_id" needs to start with "5:"
The number of revisions need to be exclusively greater then 3
The first part is easy, I have solved it with
db.nodes.find( {'_id': /^5:/} )
But I am struggling with the second part, am supposed to use $gt.
Since I am new to MongoDB, I was first looking at $size, but _revisions is not an array, it is a subdocument, right?.
Was also looking at $unwind and then counting the results, but that does not make sense either, since my result need to be the documents that match the above two conditions.
Any pointers highly appreciated.
Using the $where operator.
db.nodes.find(function() {
return (/^5:/.test(this._id) && Object.keys(this._revisions).length > 3 );
})
The problem with this as mentioned in the documentation is that:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
You should definitely consider to change the _revisions field to an array of sub-documents like this:
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : [
{
"rev": "r152f47f1daf-0-2",
"value": "c"
},
{
"rev": "r152f48413c1-0-2",
"value": "c"
},
{
"rev": "r152f4851bf7-0-1",
"value": "c"
}
]
}
And use the $exists operator.
db.nodes.find({ "_id": /^5:/, "_revisions.3": { "$exists": true } } )
Is there a way to use a user-defined function saved as db.system.js.save(...) in pipeline or mapreduce?
Any function you save to system.js is available for usage by "JavaScript" processing statements such as the $where operator and mapReduce and can be referenced by the _id value is was asssigned.
db.system.js.save({
"_id": "squareThis",
"value": function(a) { return a*a }
})
And some data inserted to "sample" collection:
{ "_id" : ObjectId("55aafd2bacbed38e06f9eccf"), "a" : 1 }
{ "_id" : ObjectId("55aafea6acbed38e06f9ecd0"), "a" : 2 }
{ "_id" : ObjectId("55aafeabacbed38e06f9ecd1"), "a" : 3 }
Then:
db.sample.mapReduce(
function() {
emit(null, squareThis(this.a));
},
function(key,values) {
return Array.sum(values);
},
{ "out": { "inline": 1 } }
);
Gives:
"results" : [
{
"_id" : null,
"value" : 14
}
],
Or with $where:
db.sample.find(function() { return squareThis(this.a) == 9 })
{ "_id" : ObjectId("55aafeabacbed38e06f9ecd1"), "a" : 3 }
But in "neither" case can you use globals such as the database db reference or other functions. Both $where and mapReduce documentation contain information of the limits of what you can do here. So if you thought you were going to do something like "look up data in another collection", then you can forget it because it is "Not Allowed".
Every MongoDB command action is actually a call to a "runCommand" action "under the hood" anyway. But unless what that command is actually doing is "calling a JavaScript processing engine" then the usage becomes irrelevant. There are only a few commands anyway that do this, being mapReduce, group or eval, and of course the find operations with $where.
The aggregation framework does not use JavaScript in any way at all. You might be mistaking just as others have done a statement like this, which does not do what you think it does:
db.sample.aggregate([
{ "$match": {
"a": { "$in": db.sample.distinct("a") }
}}
])
So that is "not running inside" the aggregation pipeline, but rather the "result" of that .distinct() call is "evaluated" before the pipeline is sent to the server. Much as with an external variable is done anyway:
var items = [1,2,3];
db.sample.aggregate([
{ "$match": {
"a": { "$in": items }
}}
])
Both essentially send to the server in the same way:
db.sample.aggregate([
{ "$match": {
"a": { "$in": [1,2,3] }
}}
])
So it is "not possible" to "call" any JavaScript function in the aggregation pipeline, nor is there really any point is "passing in" results in general from something saved in system.js. The "code" needs to be "loaded to the client" and only a JavaScript engine can actually do anything with it.
With the aggregation framework, all of the "operators" available are actually natively coded functions as opposed to the "free form" JavaScript interpretation provided for mapReduce. So instead of writing "JavaScript", you use the operators themselves:
db.sample.aggregate([
{ "$group": {
"_id": null,
"sqared": { "$sum": {
"$multiply": [ "$a", "$a" ]
}}
}}
])
{ "_id" : null, "sqared" : 14 }
So there are limitations on what you can do with functions saved in system.js, and the chances are that what you want to do is either:
Not allowed, such as accessing data from another collection
Not really required as the logic is generally self contained anyway
Or probably better implemented in client logic or other different form anyway
Just about the only practical use I can really think of is that you have a number of "mapReduce" operations that cannot be done any other way and you have various "shared" functions that you would rather just store on the server than maintain within every mapReduce function call.
But then again, the 90% reason for mapReduce over the aggregation framework is usually that the "document structure" of the collections has been poorly chosen and the JavaScript functionality is "required" to traverse the document for search and analysis.
So you can use it under the allowed constraints, but in most cases you probably should not be using this at all, but fixing the other issues that caused you to believe you needed this feature in the first place.