Mongo query: array of objects where a key's value is repeated - mongodb

I am new to Mongo. Posting this question because i am not sure how to search this on google
i have a book documents like below
{
bookId: 1
title: 'some title',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 2
}]
}
and another one like this
{
bookId: 2
title: 'some title 2',
publicationDate: DD-MM-YYYY,
editions: [{
editionId: 1
},{
editionId: 1
}]
}
I want to write a query db.books.find({}) which would return only those books where editions.editionId has been duplicated for a book.
So in this example, for bookId: 2 there are two editions with the editionId:1.
Any suggestions?

You can use the aggregation framework; specifically, you can use the $group operator to group the records together by book and edition id, and count how many times they occur : if the count is greater than 1, then you've found a duplication.
Here is an example:
db.books.aggregate([
{$unwind: "$editions"},
{$group: {"_id": {"_id": "$_id", "editionId": "$editions.editionId"}, "count": {$sum: 1}}},
{$match: {"count" : {"$gt": 1}}}
])
Note that this does not return the entire book records, but it does return their identifiers; you can then use these in a subsequent query to fetch the entire records, or do some de-duplication for example.

Related

mongodb aggregation select specific document in group

I need a bit help with mongodb aggregation.
first I have a $match to get filter some specific documents.
then I group by a field I need them grouped in.
the group I need to select a document where field value is ... and get that document as main data.
{"$match": {"$and": [
{chain: chain},
{dex: dex}
]}};
{$group: {
_id: "$pairAddress",
allChange: {"$push": "$$ROOT"},
baseToken: {$last: '$baseToken'},
txCount: {the document with timeframe inside this group 86400.txCount}
}},
{$sort: {txCount: -1}}
{$skip: 0}
{$limit: 100}
the group consist of documents with different timeframes, I need to somehow select a specific timeframe and add fields to the group from that timeframe. for example each timeframe has a different amount of txCount after group I want to sort by txCount and limit the amount and use skip for some pagination.
the problem is in selecting a document from that group with the specific timeframe.
anyone who could help me a bit to the right direction that would be awesome.
Here an example of how data is stored in the database and what I would like the result to be.
const document = {
_id: '3567356735672467',
pairAddress: '0x45jk6v34jy5634jkh5v6kj4h5v62j4h56', // group by pair address
baseToken: '0x456jn345k6hb4k5h6b3khb65k3hb56k3h4b6',
resolution: 86400, // a pair address has 6 documents with each a own timeframe 300, 900, 1800, 3600, 43200, 86400
base0: true,
txCount: 26,
buyCount: 10,
sellCount:16,
buyVolume: '2342354.345',
sellVolume: '1234.34',
volume: '1232352.345',
change: '12.34',
positive: true,
time: 1676865981,
chain: 'ETH',
dex: 'SUS',
price: '12.45',
};
const result = [
{
_id: "0x45jk6v34jy5634jkh5v6kj4h5v62j4h56",
allChange: {"$push": "$$ROOT"}, // array of all documents/timeframes for a pairAddress
selectedTxAmount: 26, // this needs to be the document with selected timeframe example 86400, selected from the group is must match the pairAddress
}
];
Maybe its possible to change the aggregation to make it work and faster.
match all timeframes, dex and chain.
sort by txCount.
skip X amount.
limit to 100
and return all document with a field containing all timestamps per the pairAddress left after the aggregation.
Currently thanks to #1sina1 I got this and it works.
{"$match": {"$and": [
{"chain": chain},
{"dex": dex}
]}},
{$group: {
_id: "$pairAddress",
allChange: {"$push": "$$ROOT"},
baseToken: {$last: '$baseToken'},
txCount: {
"$push": {
"$cond": {
"if": {
"$eq": [
"$resolution",
43200
]
},
"then": "$txCount",
"else": "$$REMOVE"
}
}
}
}},
{$sort: {txCount: -1}},
{$skip: parseInt(page) * 100},
{$limit: 100},
But I think there might be a way to do it just a bit different now we first group all (which is about 20k documents) I am only interested in 100, so maybe first match to timeframe/resolution then sort, skip, limit and then I just need from those 100 pairAddress all the according timeframes/resolutions for each as a flied allChange.

MongoDB find closest match

I'm wondering if it is possible to access a document in MongoDB via closest match.
e.g. my search query always contains:
name
country
city
Following rules are in place:
1. name always has to match
2. if either country or city is present, country has a higher priority
3. if country or city does not match only consider this document, if they have the default value (e.g. for String: "")
example Query:
name = "Test"
country = "USA"
city = "Seattle"
Documents:
db.stuff.insert([
{
name:"Test",
country:"",
city:"Seattle"
},{
name:"Test3",
country:"USA",
city:"Seattle"
},{
name:"Test",
country:"USA",
city:""
},{
name:"Test",
country:"Germany",
city:"Seattle"
},{
name:"Test",
country:"USA",
city:"Washington"
}
])
It should return the 3rd document
thanks!
Considering uncertain requirements and contradicting updates, the answer is rather a guideline addressing the "Is it possible at all" part.
The example should be adjusted to meet expectation.
db.stuff.aggregate([
{$match: {name: "Test"}}, // <== the fields that should always match
{$facet: {
matchedBoth: [
{$match: {country: "USA", city: "Seattle"}}, // <== bull's-eye
{$addFields: {weight: 10}} // <== 10 stones
],
matchedCity: [
{$match: {country: "", city: "Seattle"}}, // <== the $match may need to be improved, see below
{$addFields: {weight: 5}}
],
matchedCountry: [
{$match: {country: "USA", city: ""}},
{$addFields: {weight: 0}} // <== weightless, yet still a match
]
// add more rules here, if needed
}},
// get them together. Should list all rules from above
{$project: {doc: {$concatArrays: ["$matchedBoth", "$matchedCity", "$matchedCountry"]}}},
{$unwind: "$doc"}, // <== split them apart
{$sort: {"doc.weight": -1}}, // <== and order by weight, desc
// reshape to retrieve documents in its original format
{$project: {_id: "$doc._id", name: "$doc.name", country: "$doc.country", city: "$doc.city"}}
]);
The least explained part of the question affect how we build up facets. e.g.
{$match: {country: "", city: "Seattle"}}
matches all documents where country explicitly present and is an empty string.
It very well might be
{$match: {country: {$ne: "USA"}, city: "Seattle"}}
to get all documents with matching name and city and any country/no country, or even
{$match: {$and: [{$or: [{country: null}, {country: ""}]}, {city: "Seattle"}]}}
etc.
Here is a query
db.collection.aggregate([
{$match: {name:"Test"}},
{$project: {
name:"$name",
country: "$country",
city:"$city",
countryMatch: {$cond: [{$eq:["$country", "USA"]}, true, false]},
cityMatch: {$cond:[{$eq:["$city", "Seattle"]}, true, false]}
}},
{$match: {$and: [
{$or:[{countryMatch:true},{country:""}]},
{$or:[{cityMatch:true},{city:""}]}
]}},
{$sort: {countryMatch:-1, cityMatch:-1}},
{$project: {name:"$name", country:"$country", city:"$city"}}
])
Explanation:
First match filters out docs which don't match name (because rule #1 - name should match).
Next projection selects doc fields plus some information about country and city matches. We will need it to further filter and sort documents.
Second match filters out those documents which don't match both country and city and don't have default values for these fields (rule #3).
Sorting documents moves country matches before city matches as rule #2 states. And last - projection selects required fields.
Output:
{
_id: 3,
name : "Test",
country : "USA",
city : ""
},
{
_id: 1,
name : "Test",
country : "",
city : "Seattle"
}
You can limit query results to get only closest match.

Can I use populate before aggregate in mongoose?

I have two models, one is user
userSchema = new Schema({
userID: String,
age: Number
});
and the other is the score recorded several times everyday for all users
ScoreSchema = new Schema({
userID: {type: String, ref: 'User'},
score: Number,
created_date = Date,
....
})
I would like to do some query/calculation on the score for some users meeting specific requirement, say I would like to calculate the average of score for all users greater than 20 day by day.
My thought is that firstly do the populate on Scores to populate user's ages and then do the aggregate after that.
Something like
Score.
populate('userID','age').
aggregate([
{$match: {'userID.age': {$gt: 20}}},
{$group: ...},
{$group: ...}
], function(err, data){});
Is it Ok to use populate before aggregate? Or I first find all the userID meeting the requirement and save them in a array and then use $in to match the score document?
No you cannot call .populate() before .aggregate(), and there is a very good reason why you cannot. But there are different approaches you can take.
The .populate() method works "client side" where the underlying code actually performs additional queries ( or more accurately an $in query ) to "lookup" the specified element(s) from the referenced collection.
In contrast .aggregate() is a "server side" operation, so you basically cannot manipulate content "client side", and then have that data available to the aggregation pipeline stages later. It all needs to be present in the collection you are operating on.
A better approach here is available with MongoDB 3.2 and later, via the $lookup aggregation pipeline operation. Also probably best to handle from the User collection in this case in order to narrow down the selection:
User.aggregate(
[
// Filter first
{ "$match": {
"age": { "$gt": 20 }
}},
// Then join
{ "$lookup": {
"from": "scores",
"localField": "userID",
"foriegnField": "userID",
"as": "score"
}},
// More stages
],
function(err,results) {
}
)
This is basically going to include a new field "score" within the User object as an "array" of items that matched on "lookup" to the other collection:
{
"userID": "abc",
"age": 21,
"score": [{
"userID": "abc",
"score": 42,
// other fields
}]
}
The result is always an array, as the general expected usage is a "left join" of a possible "one to many" relationship. If no result is matched then it is just an empty array.
To use the content, just work with an array in any way. For instance, you can use the $arrayElemAt operator in order to just get the single first element of the array in any future operations. And then you can just use the content like any normal embedded field:
{ "$project": {
"userID": 1,
"age": 1,
"score": { "$arrayElemAt": [ "$score", 0 ] }
}}
If you don't have MongoDB 3.2 available, then your other option to process a query limited by the relations of another collection is to first get the results from that collection and then use $in to filter on the second:
// Match the user collection
User.find({ "age": { "$gt": 20 } },function(err,users) {
// Get id list
userList = users.map(function(user) {
return user.userID;
});
Score.aggregate(
[
// use the id list to select items
{ "$match": {
"userId": { "$in": userList }
}},
// more stages
],
function(err,results) {
}
);
});
So by getting the list of valid users from the other collection to the client and then feeding that to the other collection in a query is the onyl way to get this to happen in earlier releases.

mongodb/meteor: how to I get the value of one field corresponding to the $max value of another field?

I have a collection of messsages with the following fields: _id, senderId, receiverId, dateSubmittedMs, message, and for a given user I want to return the latest message to him from all other users. So, for example, if there are users Alex, Barb, Chuck, Dora, I would like to return the most recent message between Alex and each of Barb, Chuck and Dora. What is the best way to do this? Can I do it in one step using aggregation?
The aggregation examples in the official online documentation (http://docs.mongodb.org/manual/reference/aggregation/min/) show how to find the lowest age over groups within a collection, but what I need is something analogous to finding the name of the youngest person over groups of people.
Here is my current approach:
Step 1: Find the highest value for dateSubmitted over all messages sent and received by Alex, grouping over the other users:
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$group: {_id: "$receiverId", lastestSubmitted: {$max: "$submitted"} }}).fetch();
Step 2: Create an array of these highest values of dateSubmitted:
var MIds = _.pluck(M,'lastestSubmitted');
Step 3: Find these messages, by senderId, receiverId, and latestSubmitted:
return Messages.find(
{submitted: {$in: MIds}, $or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]},
{$sort: {submitted: 1}}
});
There are two problems with this:
Can it be done in one step instead of three? Perhaps through a mapReduce or Aggregate command?
Instead of grouping only over the receiverId: 'Alex', is there a way to group over something like: $or [{receiverId: 'Alex', senderId: 'Barb'}, {senderId: 'Alex', receiverId: 'Barb'}]? (but for EACH of the other users) This would allow me to get the latest message in a conversation between any two participants that Alex conversed with. So for example:
Any suggestions?
The only thing you have to change is the group._id in the grougping phase. While you can use a document for this propose not just a field, you can apply the sort in the same pipeline. However it is not change if you use only the receiverId for grouping.
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$group:
{_id: {
receiverId: "$receiverId",
senderId: "$senderId"},
lastestSubmitted: {$max: "$submitted"} }
},
{$sort: {submitted: -1}
},
{$limit: 1}
).fetch();
This example above only adds the possiblity to check for the second most recently pinged connection or the third one, if you only likely to get the most recent message you do not even need to group. Just run this:
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$sort: {submitted: -1}
},
{$limit: 1}
).fetch();
THE ANSWER ABOVE THIS IS FOR GETTING THE MOST RECENT MESSAGE BETWEEN A SINGLE USER AND OTHERS, TO GET THE MOST RECENT MESSAGE BETWEEN ALL PAIRS INVOLVING A USER READ UNDER>>>
Based on the comments i misinterpreted a bit the question but the part above is useful anyway. The correct resolution for your problem is under. The difficulty in is to get a key for the pair to group on and identify the switched pair ( bob -> tom == tom -> bob) in this case. U can use condition and ordering to identify the swaps. (It is certainly a much more difficult question) The code look like this:
var M = Messages.aggregate(
{$match:
{$or: [{senderId: 'Alex'}, {receiverId: 'Alex'}]}
},
{$project:
{'part1':{$cond:[{$gt:['$senderId','$receiverId']},'$senderId','$receiverId']},
'part2':{$cond:[{$gt:['$senderId','$receiverId']},'$receiverId','$senderId']},
'message':1,
'submitted':1
}
},
{$sort: {submitted: -1}},
{$group:
{_id: {
part1: "$part1",
part2: "$part2"},
lastestSubmitted: {$first: "$submitted"},
message: {$first: "$message"} }
}
).fetch();
If you are not familiar with some of the operators used above, like $cond or $first, check out this.

Facet search using MongoDB

I am contemplating to use MongoDB for my next project. One of the core requirements for this application is to provide facet search. Has anyone tried using MongoDB to achieve a facet search?
I have a product model with various attributes like size, color, brand etc. On searching a product, this Rails application should show facet filters on sidebar. Facet filters will look something like this:
Size:
XXS (34)
XS (22)
S (23)
M (37)
L (19)
XL (29)
Color:
Black (32)
Blue (87)
Green (14)
Red (21)
White (43)
Brand:
Brand 1 (43)
Brand 2 (27)
I think using Apache Solr or ElasticSearch you get more flexibility and performance, but this is supported using Aggregation Framework.
The main problem using MongoDB is you have to query it N Times: First for get matching results and then once per group; while using a full text search engine you get it all in one query.
Example
//'tags' filter simulates the search
//this query gets the products
db.products.find({tags: {$all: ["tag1", "tag2"]}})
//this query gets the size facet
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}},
{$group: {_id: "$size"}, count: {$sum:1}},
{$sort: {count:-1}}
)
//this query gets the color facet
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}},
{$group: {_id: "$color"}, count: {$sum:1}},
{$sort: {count:-1}}
)
//this query gets the brand facet
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}},
{$group: {_id: "$brand"}, count: {$sum:1}},
{$sort: {count:-1}}
)
Once the user filters the search using facets, you have to add this filter to query predicate and match predicate as follows.
//user clicks on "Brand 1" facet
db.products.find({tags: {$all: ["tag1", "tag2"]}, brand: "Brand 1"})
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}, brand: "Brand 1"},
{$group: {_id: "$size"}, count: {$sum:1}},
{$sort: {count:-1}}
)
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}, brand: "Brand 1"},
{$group: {_id: "$color"}, count: {$sum:1}},
{$sort: {count:-1}}
)
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}, brand: "Brand 1"},
{$group: {_id: "$brand"}, count: {$sum:1}},
{$sort: {count:-1}}
)
Mongodb 3.4 introduces faceted search
The $facet stage allows you to create multi-faceted aggregations which
characterize data across multiple dimensions, or facets, within a
single aggregation stage. Multi-faceted aggregations provide multiple
filters and categorizations to guide data browsing and analysis.
Input documents are passed to the $facet stage only once.
Now, you dont need to query N times for retrieving aggregations on N groups.
$facet enables various aggregations on the same set of input documents,
without needing to retrieve the input documents multiple times.
A sample query for the OP use-case would be something like
db.products.aggregate( [
{
$facet: {
"categorizedByColor": [
{ $match: { color: { $exists: 1 } } },
{
$bucket: {
groupBy: "$color",
default: "Other",
output: {
"count": { $sum: 1 }
}
}
}
],
"categorizedBySize": [
{ $match: { size: { $exists: 1 } } },
{
$bucket: {
groupBy: "$size",
default: "Other",
output: {
"count": { $sum: 1 }
}
}
}
],
"categorizedByBrand": [
{ $match: { brand: { $exists: 1 } } },
{
$bucket: {
groupBy: "$brand",
default: "Other",
output: {
"count": { $sum: 1 }
}
}
}
]
}
}
])
A popular option for more advanced search with MongoDB is to use ElasticSearch in conjunction with the community supported MongoDB River Plugin. The MongoDB River plugin feeds a stream of documents from MongoDB into ElasticSearch for indexing.
ElasticSearch is a distributed search engine based on Apache Lucene, and features a RESTful JSON interface over http. There is a Facet Search API and a number of other advanced features such as Percolate and "More like this".
You can do the query, the question would be is it fast or not. ie something like:
find( { size:'S', color:'Blue', Brand:{$in:[...]} } )
the question is then how is the performance. There isn't any special facility for faceted search in the product yet. Down the road there might be some set intersection-like query plans that are good but that is tbd/future.
If your properties are a predefined set and you know what they are you could create an index on each of them. Only one of the indexes will be used in the current implementation so this will help but only get you so far: if the data set is medium plus in size it might be fine.
You could use compound indexes which perhaps compound two or more of the properties. If you have a small # of properties this might work pretty well. The index need not use all the variables queries on but in the one above a compound index on any two of the three is likely to perform better than an index on a single item.
If you dont have too many skus brute force would work; e.g. if you are 1MM skues a table scan in ram might be fast enough. in this case i would make a table with just the facet values and make it as small as possible and keep the full sku docs in a separate collection. e.g.:
facets_collection:
{sz:1,brand:123,clr:'b',_id:}
...
if the # of facet dimensions isnt' too high you could instead make a highly compound index of the facit dimensions and you would get the equivalent to the above without the extra work.
if you create quit a few indexes, it is probably best to not create so many that they no longer fit in ram.
given the query runs and it is a performance question one might just with mongo and if it isn't fast enough then bolt on solr.
The faceted solution (count based) depends on your application design.
db.product.insert(
{
tags :[ 'color:green','size:M']
}
)
However, if one is able to feed data in the above format where facets and their values are joined together to form a consistent tag, then using the below query
db.productcolon.aggregate(
[
{ $unwind : "$tags" },
{
$group : {
_id : '$tags',
count: { $sum: 1 }
}
}
]
)
See the result output below
{
"_id" : "color:green",
"count" : NumberInt(1)
}
{
"_id" : "color:red",
"count" : NumberInt(1)
}
{
"_id" : "size:M",
"count" : NumberInt(3)
}
{
"_id" : "color:yellow",
"count" : NumberInt(1)
}
{
"_id" : "height:5",
"count" : NumberInt(1)
}
Beyond this step, your application server can do a color/size grouping before sending back to the client.
Note - The approach to combine facet and its values gives you all facet values agggregated and you can avoid - "The main problem using MongoDB is you have to query it N Times: First for get matching results and then once per group; while using a full text search engine you get it all in one query." see Garcia's answer