mongodb aggregate with extra info - mongodb

I have a mongo collection containing docs such as this:
{
"_id" : ObjectId("57697321c22d3917acd66513"),
"parent" : "AlphaNumericID",
"signature" : "AnotherAlphaNumericID",
"price" : 1638,
"url" : "http://www.thecompany.com/path/to/page1",
"date" : ISODate("2016-06-21T17:02:20.352Z"),
"valid" : true
}
What I am trying to do is to run one query that would group on signature filed, return min and max price AND corresponding url:
{
"signature" : "AnotherAlphaNumericID",
"min_price" : 1504,
"min_rent_listing" : "http://www.thecompany.com/path/to/page1",
"max_price" : 1737,
"max_price_listing" : "http://www.thecompany.com/path/to/page2",
}
Running a $group on $signature field to obtain $min and $max is straight forward but in order to get the actual urls I split the query into 2 with the first query returning a sorted list of docs using $signature with prices from min to max and then (in python code) taking the first and last element. This works fine but would be nice to have one query.
Thoughts?
p.s.
Also 'toyed' with running one query for min and one for max and 'zipping' the results.

You can play a trick with help of $group and $project. Assuming dataset is
{
"_id" : ObjectId("57db28dc705af235a826873a"),
"parent" : "AlphaNumericID",
"signature" : "AnotherAlphaNumericID",
"price" : 1638.0,
"url" : "http://www.thecompany.com/path/to/page1",
"date" : ISODate("2016-06-21T17:02:20.352+0000"),
"valid" : true
}
{
"_id" : ObjectId("57db28dc705af235a826873b"),
"parent" : "AlphaNumericID",
"signature" : "AnotherAlphaNumericID",
"price" : 168.0,
"url" : "http://www.thecompany.com/path/to/page2",
"date" : ISODate("2016-06-21T17:02:20.352+0000"),
"valid" : true
}
{
"_id" : ObjectId("57db28dc705af235a826873c"),
"parent" : "AlphaNumericID",
"signature" : "AnotherAlphaNumericID",
"price" : 163.0,
"url" : "http://www.thecompany.com/path/to/page3",
"date" : ISODate("2016-06-21T17:02:20.352+0000"),
"valid" : true
}
{
"_id" : ObjectId("57db28dc705af235a826873d"),
"parent" : "AlphaNumericID",
"signature" : "AnotherAlphaNumericID",
"price" : 1680.0,
"url" : "http://www.thecompany.com/path/to/page4",
"date" : ISODate("2016-06-21T17:02:20.352+0000"),
"valid" : true
}
Try following query in shell
db.collection.aggregate([
{$sort:{price:1}},
{$group:{
_id:"$signature",
_first:{$first:"$url"},
_last:{$last:"$url"},
_min:{$first:"$price"},
_max:{$last:"$price"}}
},
{$project:{
_id:0,
min:{
url:"$_first",
price:"$_min"},
max:{
url:"$_last",
price:"$_max"}}
}
])
Output will be with minimum/maximum price and corresponding url
{
"min" : {
"url" : "http://www.thecompany.com/path/to/page3",
"price" : 163.0
},
"max" : {
"url" : "http://www.thecompany.com/path/to/page4",
"price" : 1680.0
}
}
What I changed from original answer:
_min:{$min:"$price"}, --> to use $first
_max:{$max:"$price"}} --> to use $last
Reason: we go into the pipeline with an ascending sort on price. By default, first record is min and last record is max.

Related

How create a MongoDB query

I have a collection with some documents like below:
{
"_id" : ObjectId("1"),
"className" : "model.MyClass",
"createdOn" : ISODate("2018-10-23T11:00:00.000+01:00"),
"status" : "A"
}
{
"_id" : ObjectId("2"),
"className" : "model.MyClass",
"createdOn" : ISODate("2018-10-23T11:01:00.000+01:00"),
"status" : "B"
}
{
"_id" : ObjectId("3"),
"className" : "model.MyClass",
"createdOn" : ISODate("2018-10-23T11:02:00.000+01:00"),
"status" : "C"
}
{
"_id" : ObjectId("4"),
"className" : "model.MyClass",
"createdOn" : ISODate("2018-10-23T11:03:00.000+01:00"),
"status" : "D"
}
Given a specific ID, how can I get the previous document that whose status not equals a specific status.
For example, I give the ID 4 and like to get the last document that status not is B neither C. So, I get the Object with Id 1.
How to create this query?
you could try this:
db.yourcollection.find( {"status":{"$nin":["B","C"]}}
).sort({_id:-1}).limit(1);
so use not in operator i.e. $nin, then sort the data in descending order and limit the records to 1
see below documentations for details.
$nin operator
mongo sort

How to get Documents filtered by the top-most and sort embedded result

Assumed i have the following Document structure.
{
"_id" : "ehbpidnopgpgcghgakiiiallielefonk",
"users" : 11.87,
"ratingTopMost" : {
"average" : NumberInt("4"),
"users" : NumberInt("3174"),
},
"reviews" : {
"average" : NumberInt("5"),
"count" : NumberInt("51"),
"comments" : [
{
"_id" : ObjectId("5b87b1a07267ad001da7e733"),
"name" : "Batista1395",
"comment" : "Great Stackoverflow",
"datum" : ISODate("2017-01-16T00:00:00.000+01:00"),
"rating" : NumberInt("3")
},
{
"_id" : ObjectId("5b87b1a07267ad001da7e732"),
"name" : "Tim Ace",
"comment" : "I Like it.",
"datum" : ISODate("2016-08-17T00:00:00.000+02:00"),
"bewertung" : NumberInt("5")
},
]
},
}
I need to query the top most 20 documents with the following criteria:
Top most value for rantingTopMost.average
Top most value for ratingTopMost.users
Top most value for reviews.count
Top most value for reviews.average
Top most value for users
The priority should from top to bottom (1 to 5).
Nice to have too: sort the embedded comments by Date DESC.

MongoDB - How to remove duplicates

I have a collection which have many duplicates due to the routines that populated it in the first place. How to dedupe these?
e.g.
{ "_id" : ObjectId("531a5fe448757e00244096fa"), "code" : "ap", "name" : "[Almost Perfect]", "value" : "[u'*']" }
{ "_id" : ObjectId("531a731148757e17587a6e04"), "code" : "ap", "name" : "[Almost Perfect]", "value" : "[u'*']" }
{ "_id" : ObjectId("531a7bb848757e1f7c0ca702"), "code" : "ap", "name" : "[Almost Perfect]", "value" : "[u'*']" }
I want it to be just one (don't care which objectID gets picked)
{ "_id" : ObjectId("531a5fe448757e00244096fa"), "code" : "ap", "name" : "[Almost Perfect]", "value" : "[u'*']" }
You should use an Index over you code field:
db.<collection>.ensureIndex({'code' : 1}, {unique : true, dropDups : true})
unique will ensure you will not have duplicates anymore.
dropDups will delete all your duplicate documents when the ensureIndex operation is run

MongoDB query to break ties and remove duplicates

I have documents which have a Version, URL, and DateAdded field (among others but these are the relevant ones).
I'd like to find all documents where the Version is "5.5" and the DateAdded is less than or equal to January 1, 2013. That's pretty straightforward, but I also want the following behavior:
If two or more documents have the same URL, only return the one with the most recent DateAdded (provided, again, that is is less than or equal to January 1, 2013). It would be great if all of this could be expressed in a single query (but my main concern is performance).
I've been doing this last bit of filtering in my client code (outside of MongoDB) but this ends up being inefficient, not to mention inelegant.
I've also tried using Mongo's MapReduce functionality to accomplish the same thing but this is extremely slow, as it appears to copy much of my collection to another collection.
Is there a performant solution?
This should do the trick.
Example data:
db.foo.insert({ "_id" : ObjectId("528bd5bded29286a62959513"), "Version" : "5.3", "URL" : "foo.bar.com/asdfwoaef", "DateAdded" : ISODate("2012-10-05T00:00:00Z") })
db.foo.insert({ "_id" : ObjectId("528bd5e8ed29286a62959514"), "Version" : "5.6", "URL" : "foo.bar.com/asdfwoaef", "DateAdded" : ISODate("2012-12-05T00:00:00Z") })
db.foo.insert({ "_id" : ObjectId("528bd621ed29286a62959515"), "Version" : "5.5", "URL" : "foo.bar.com/aafoobbb", "DateAdded" : ISODate("2012-11-04T00:00:00Z") })
db.foo.insert({ "_id" : ObjectId("528bd629ed29286a62959516"), "Version" : "5.5", "URL" : "foo.bar.com/aafoobbb", "DateAdded" : ISODate("2012-11-05T00:00:00Z") })
db.foo.insert({ "_id" : ObjectId("528bd642ed29286a62959517"), "Version" : "5.5", "URL" : "foo.bar.com/aafoobbb", "DateAdded" : ISODate("2013-01-02T00:00:00Z") })
db.foo.insert({ "_id" : ObjectId("528bd744ed29286a62959518"), "Version" : "5.5", "URL" : "foo.bar.com/ccbarcc", "DateAdded" : ISODate("2013-01-02T00:00:00Z") })
db.foo.insert({ "_id" : ObjectId("528bd780ed29286a62959519"), "Version" : "5.5", "URL" : "foo.bar.com/ccbarcc", "DateAdded" : ISODate("2012-04-05T00:00:00Z") })
Pipeline:
pipeline = [
{
"$match" : {
"Version" : "5.5",
"DateAdded" : {
"$lt" : ISODate("2013-01-01T00:00:00Z")
}
}
},
{
"$sort" : {
"URL" : 1,
"DateAdded" : -1
}
},
{
"$group" : {
"_id" : "$URL",
"doc" : {
"$first" : {
"id" : "$_id",
"DateAdded" : "$DateAdded"
}
}
}
}
]
db.foo.aggregate(pipeline)
And here is the result:
{
"result" : [
{
"_id" : "foo.bar.com/ccbarcc",
"doc" : {
"id" : ObjectId("528bd780ed29286a62959519"),
"DateAdded" : ISODate("2012-04-05T00:00:00Z")
}
},
{
"_id" : "foo.bar.com/aafoobbb",
"doc" : {
"id" : ObjectId("528bd629ed29286a62959516"),
"DateAdded" : ISODate("2012-11-05T00:00:00Z")
}
}
],
"ok" : 1
}

mongodb $maxScan didn't equals limit

This is my first question on stack overflow, I am so happy and await your answers. My question is:
When I use MongoDB Query Selectors, I want limit results. But $maxScan is not work as I want.
---------This is What I want result.
db.post.find({query:{status:"publish"},$orderby:{date:-1}},{status:1,name:1,date:1,$slice:2}).limit(3)
{ "_id" : ObjectId("519262580cf21fb1647fb765"), "date" : ISODate("2013-05-14T16:12:08.600Z"), "status" : "publish", "name" : "关于多说" }
{ "_id" : ObjectId("519254ad0cf2f064f6ecef82"), "date" : ISODate("2013-05-14T15:13:49.017Z"), "status" : "publish", "name" : "回顾<蜗居>的100句经典台词" }
{ "_id" : ObjectId("519254690cf2f064f6ecef81"), "date" : ISODate("2013-05-14T15:12:41.462Z"), "status" : "publish", "name" : "女人脱光了是什么" }
-----------This is the results I use $maxScan
db.post.find({query:{status:"publish"},$maxScan:3,$orderby:{date:-1}},{status:1,name:1,date:1})
{ "_id" : ObjectId("518e6c690cf21a363df2956e"), "date" : ISODate("2013-05-11T16:06:01.341Z"), "status" : "publish", "name" : "淘宝新店,充值任务" }
I find may be the $maxScan didn't like limit(). it first limit the collection data and then execute the query! but this is not I want. Is anything i wrong? please help.Thanks
--------------All results
db.post.find({query:{},$orderby:{date:-1}},{status:1,name:1,date:1})
{ "_id" : ObjectId("519262580cf21fb1647fb765"), "date" : ISODate("2013-05-14T16:12:08.600Z"), "status" : "publish", "name" : "关于多说" }
{ "_id" : ObjectId("519254ad0cf2f064f6ecef82"), "date" : ISODate("2013-05-14T15:13:49.017Z"), "status" : "publish", "name" : "回顾<蜗居>的100句经典台词" }
{ "_id" : ObjectId("519254690cf2f064f6ecef81"), "date" : ISODate("2013-05-14T15:12:41.462Z"), "status" : "publish", "name" : "女人脱光了是什么" }
{ "_id" : ObjectId("518ee61a0cf22bd326d60215"), "date" : ISODate("2013-05-12T00:45:14.295Z"), "status" : "publish", "name" : "JSTL日期格式化用法(转载)" }
{ "_id" : ObjectId("518e6c690cf21a363df2956e"), "date" : ISODate("2013-05-11T16:06:01.341Z"), "status" : "publish", "name" : "淘宝新店,充值任务" }
{ "_id" : ObjectId("518e21c90cf21a363df2956d"), "date" : ISODate("2013-05-11T10:47:37.803Z"), "status" : "draft", "name" : "一夜没睡" }
{ "_id" : ObjectId("518df75d0cf21a363df2956c"), "date" : ISODate("2013-05-11T07:46:37.726Z"), "status" : "draft", "name" : "飞娥入侵" }
{ "_id" : ObjectId("518d80630cf21a363df2956b"), "date" : ISODate("2013-05-10T23:18:59.323Z"), "status" : "publish", "name" : "Java的日期格式化常用方法" }
To return only the top results, you should use limit(), which will limit the amount of results returned from the cursor. This is commonly used with skip() to paginate the results.
It's not explained very clearly in the docs, but $maxScan as the name suggests limits the number of documents the query will examine. Presumably your query is examining some documents which don't meet the criteria (with status != publish) and then discarding them.
Do you have an index on status? It's possible that could help the query return the results you want while scanning fewer documents, but I still think limit() is what you want.