Facet search using MongoDB - mongodb

I am contemplating to use MongoDB for my next project. One of the core requirements for this application is to provide facet search. Has anyone tried using MongoDB to achieve a facet search?
I have a product model with various attributes like size, color, brand etc. On searching a product, this Rails application should show facet filters on sidebar. Facet filters will look something like this:
Size:
XXS (34)
XS (22)
S (23)
M (37)
L (19)
XL (29)
Color:
Black (32)
Blue (87)
Green (14)
Red (21)
White (43)
Brand:
Brand 1 (43)
Brand 2 (27)

I think using Apache Solr or ElasticSearch you get more flexibility and performance, but this is supported using Aggregation Framework.
The main problem using MongoDB is you have to query it N Times: First for get matching results and then once per group; while using a full text search engine you get it all in one query.
Example
//'tags' filter simulates the search
//this query gets the products
db.products.find({tags: {$all: ["tag1", "tag2"]}})
//this query gets the size facet
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}},
{$group: {_id: "$size"}, count: {$sum:1}},
{$sort: {count:-1}}
)
//this query gets the color facet
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}},
{$group: {_id: "$color"}, count: {$sum:1}},
{$sort: {count:-1}}
)
//this query gets the brand facet
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}},
{$group: {_id: "$brand"}, count: {$sum:1}},
{$sort: {count:-1}}
)
Once the user filters the search using facets, you have to add this filter to query predicate and match predicate as follows.
//user clicks on "Brand 1" facet
db.products.find({tags: {$all: ["tag1", "tag2"]}, brand: "Brand 1"})
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}, brand: "Brand 1"},
{$group: {_id: "$size"}, count: {$sum:1}},
{$sort: {count:-1}}
)
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}, brand: "Brand 1"},
{$group: {_id: "$color"}, count: {$sum:1}},
{$sort: {count:-1}}
)
db.products.aggregate(
{$match: {tags: {$all: ["tag1", "tag2"]}}, brand: "Brand 1"},
{$group: {_id: "$brand"}, count: {$sum:1}},
{$sort: {count:-1}}
)

Mongodb 3.4 introduces faceted search
The $facet stage allows you to create multi-faceted aggregations which
characterize data across multiple dimensions, or facets, within a
single aggregation stage. Multi-faceted aggregations provide multiple
filters and categorizations to guide data browsing and analysis.
Input documents are passed to the $facet stage only once.
Now, you dont need to query N times for retrieving aggregations on N groups.
$facet enables various aggregations on the same set of input documents,
without needing to retrieve the input documents multiple times.
A sample query for the OP use-case would be something like
db.products.aggregate( [
{
$facet: {
"categorizedByColor": [
{ $match: { color: { $exists: 1 } } },
{
$bucket: {
groupBy: "$color",
default: "Other",
output: {
"count": { $sum: 1 }
}
}
}
],
"categorizedBySize": [
{ $match: { size: { $exists: 1 } } },
{
$bucket: {
groupBy: "$size",
default: "Other",
output: {
"count": { $sum: 1 }
}
}
}
],
"categorizedByBrand": [
{ $match: { brand: { $exists: 1 } } },
{
$bucket: {
groupBy: "$brand",
default: "Other",
output: {
"count": { $sum: 1 }
}
}
}
]
}
}
])

A popular option for more advanced search with MongoDB is to use ElasticSearch in conjunction with the community supported MongoDB River Plugin. The MongoDB River plugin feeds a stream of documents from MongoDB into ElasticSearch for indexing.
ElasticSearch is a distributed search engine based on Apache Lucene, and features a RESTful JSON interface over http. There is a Facet Search API and a number of other advanced features such as Percolate and "More like this".

You can do the query, the question would be is it fast or not. ie something like:
find( { size:'S', color:'Blue', Brand:{$in:[...]} } )
the question is then how is the performance. There isn't any special facility for faceted search in the product yet. Down the road there might be some set intersection-like query plans that are good but that is tbd/future.
If your properties are a predefined set and you know what they are you could create an index on each of them. Only one of the indexes will be used in the current implementation so this will help but only get you so far: if the data set is medium plus in size it might be fine.
You could use compound indexes which perhaps compound two or more of the properties. If you have a small # of properties this might work pretty well. The index need not use all the variables queries on but in the one above a compound index on any two of the three is likely to perform better than an index on a single item.
If you dont have too many skus brute force would work; e.g. if you are 1MM skues a table scan in ram might be fast enough. in this case i would make a table with just the facet values and make it as small as possible and keep the full sku docs in a separate collection. e.g.:
facets_collection:
{sz:1,brand:123,clr:'b',_id:}
...
if the # of facet dimensions isnt' too high you could instead make a highly compound index of the facit dimensions and you would get the equivalent to the above without the extra work.
if you create quit a few indexes, it is probably best to not create so many that they no longer fit in ram.
given the query runs and it is a performance question one might just with mongo and if it isn't fast enough then bolt on solr.

The faceted solution (count based) depends on your application design.
db.product.insert(
{
tags :[ 'color:green','size:M']
}
)
However, if one is able to feed data in the above format where facets and their values are joined together to form a consistent tag, then using the below query
db.productcolon.aggregate(
[
{ $unwind : "$tags" },
{
$group : {
_id : '$tags',
count: { $sum: 1 }
}
}
]
)
See the result output below
{
"_id" : "color:green",
"count" : NumberInt(1)
}
{
"_id" : "color:red",
"count" : NumberInt(1)
}
{
"_id" : "size:M",
"count" : NumberInt(3)
}
{
"_id" : "color:yellow",
"count" : NumberInt(1)
}
{
"_id" : "height:5",
"count" : NumberInt(1)
}
Beyond this step, your application server can do a color/size grouping before sending back to the client.
Note - The approach to combine facet and its values gives you all facet values agggregated and you can avoid - "The main problem using MongoDB is you have to query it N Times: First for get matching results and then once per group; while using a full text search engine you get it all in one query." see Garcia's answer

Related

Why sort document by id is slower with $match than not in mongodb?

So, I tried to query
db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
the query above takes up 20s
but If I tried to query
db.collection('collection_name').aggregate([{$sort : {_id : -1}}])
it's only take 0.7s
Why does it the one without $match is actually faster than without match ?
update :
when I try this query
db.getCollection('callbackvirtualaccounts').aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
it's only takes 0.781s
Why sort by _id is slower than by created field ?
note : I'm using mongodb v3.0.0
db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
This collection probably won't be having and index on owner_id; Try using below mentioned index creation query and rerun your previous code.
db.collection('collection_name').createIndexes({ owner_id:1}) //Simple Index
or
db.collection('collection_name').createIndexes({ owner_id:1,_id:-1}) //Compound Index
**Note:: If you don't know how to compound index yet, you can create simple indexes individually on all keys which are used either in match or sort and that should be making query efficient as well.
The query speed depends upon a lot of factors. The size of collection, size of the document, indexes defined on the collection (and used in the queries and properly), the hardware components (like CPU, RAM, network) and other processes running at the time the query is running.
You have to tell what indexes are defined on the collection being discussed for further analysis. The command will retrieve them: db.collection.getIndexes()
Note the unique index on the _id field is created by default, and cannot be modified or deleted.
(i)
But If I tried to query: db.collection.aggregate( [ { $sort : { _id : -1 } } ] ) it's
only take 0.7s.
The query is faster because there is an index on the _id field and it is used in sort process. Aggregation queries use indexes with sort stage and when this sort happens early in the pipeline. You can verify if the index is used or not by generating a query plan (use explain with executionStats mode). There will be an index scan (IXSCAN) in the generated query plan.
(ii)
db.collection.aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}
])
The query above takes up 20s.
When I try this query it's only takes 0.781s.
db.collection.aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
Why sort by _id is slower than by created field ?
Cannot come to any conclusions with the available information. In general, the $match and $sort stages present early in the aggregation query can use any indexes created on the fields used in the operations.
Generating a query plan will reveal what the issues are.
Please run the explain with executionStats mode and post the query plan details for all queries in question. There is documentation for Mongodb v3.0.0 version on generation query plans using explain: db.collection.explain()

MongoDB aggregation/map-reduce

I'm new to MongoDB and I need to do an aggregation which seems to me quite difficult. A document looks something like this
{
"_id" : ObjectId("568192aef8bd6b0cd0f649c6"),
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"prism:aggregationType" : "Conference Proceeding",
"children-id" : [
"SCOPUS_ID:84948148564",
"SCOPUS_ID:84927603733",
"SCOPUS_ID:84943521758",
"SCOPUS_ID:84905234683",
"SCOPUS_ID:84876113709"
],
"dc:identifier" : "SCOPUS_ID:84867598678"
}
The example contains just the fields I need in the aggregation. Prism:aggregationType can have 5 different values(conference proceeding, book, journal etc.). Children-id says that this document is cited by an array of other documents(SCOPUS_ID is an unique ID for each document).
What I want to do is to group first by conference, then for each conference I want to know for each prism:aggregationType how many citing documents are($gt > 0).
For example, lets say there are 100 documents that have the conference from above. These 100 documents are cited by 250 documents. I want to know from all of these 250 documents how many have "prism:aggregationType" : "Conference Proceeding", "prism:aggregationType" : "Journal" etc.
An output could look like this:
{
"conference" : "IEEE International Conference on Acoustics, Speech and Signal Processing",
"aggregationTypes" : [{"Conference Proceeding" : 50} , {"Journal" : 200}]
}
It is not important if it is done with aggregation pipeline or map-reduce.
EDIT
Is there any way to combine these 2 into one aggregation:
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: {conference: '$conference'},
'cited-by':{$push:{'dc:identifier':"$children-id"}}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
In the query I want to replace the array from $in with the array created with $push
Please try this one through aggregation
> db.collections
.aggregate([
// 1. get the size of `children-id` array through $project
{$project: {
conference: 1,
IEEE1: 1,
'prism:aggregationType': 1,
'children-id': {$size: '$children-id'}
}},
// 2. group by `conference` and `prism:aggregationType` and sum the size of `children-id`
{$group: {
_id: {
conference:'$conference',
aggregationType: '$prism:aggregationType'
},
ids: {$sum: '$children-id'}
}},
// 3. group by `conference`, and make pair of the conference processing ids size and journal ids size
{$group: {
_id: '$_id.conference',
aggregationTypes: {
$cond: [{$eq: ['$_id.aggregationType', 'Conference Proceeding']},
{$push: {"Conference Proceeding": '$ids'}},
{$push: {"Journal": '$ids'}}
]}
}}
]);
As we had a chat,
using $lookup in aggregation pipeline is unfortunately bonded to mongodb 3.2 which is not a case as R driver can use mongo 2.6 and source documents are in more than one collection.
The code I wrote in the EDIT section is also the final result I come up with(a little bit modified)
db.articles.aggregate([
{ $match:{
conference : {$ne : null}
}},
{$unwind:'$children-id'},
{$group: {
_id: '$conference',
'cited-by':{$push:"$children-id"}
}}
]);
db.articles.find( { 'dc:identifier': { $in: [ 'SCOPUS_ID:84943302953', 'SCOPUS_ID:84927603733'] } }, {'prism:aggregationType':1} );
The result will look like this for each conference:
{
"_id" : "Annual Conference on Privacy, Security and Trust",
"cited-by" : [
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84928151617",
"SCOPUS_ID:84939229259",
"SCOPUS_ID:84946407175",
"SCOPUS_ID:84933039513",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84942607254",
"SCOPUS_ID:84948165954",
"SCOPUS_ID:84926379258",
"SCOPUS_ID:84946771354",
"SCOPUS_ID:84944223683",
"SCOPUS_ID:84942789431",
"SCOPUS_ID:84939169499",
"SCOPUS_ID:84947104346",
"SCOPUS_ID:84948764343",
"SCOPUS_ID:84938075139",
"SCOPUS_ID:84946196118",
"SCOPUS_ID:84930820238",
"SCOPUS_ID:84947785321",
"SCOPUS_ID:84933496680",
"SCOPUS_ID:84942789431"
]
}
I iterate through all the documents I get (around 250) and then I use the cited-by array inside $in. I use index over dc:identifier so it works instantly.
$lookup could be an alternative to get the things done from aggregate pipeline but packages in R does not support versions above 2.6.
Thank you for your time anyway :)

How to filter on more than one record mongodb embedded documents

This is my model:
order:{
_id: 88565,
activity:
[
{_id: 57235, content: "foo"},
{_id: 57236, content: "bar"}
]
}
This is my query:
db.order.find({
"$and": [
{
"activity.content "bar"
},
{
"activity._id": 57235
}
]
});
This query will select the order with id 88565 even if the conditions are satisfied together by 2 different embedded activities.
I would expect that this query returned nothing.
I know that I can use elemMatch to filter embedded documents with more precision but this behaviour seems very confusing.
Is there a way to obtain a proper filtering where an AND clause has a single embedded document scope?

MongoDB - sort by subdocument match

Say I have a users collection in MongoDB. A typical user document contains a name field, and an array of subdocuments, representing the user's characteristics. Say something like this:
{
"name": "Joey",
"characteristics": [
{
"name": "shy",
"score": 0.8
},
{
"name": "funny",
"score": 0.6
},
{
"name": "loving",
"score": 0.01
}
]
}
How can I find the top X funniest users, sorted by how funny they are?
The only way I've found so far, was to use the aggregation framework, in a query similar to this:
db.users.aggregate([
{$project: {"_id": 1, "name": 1, "characteristics": 1, "_characteristics": '$characteristics'}},
{$unwind: "$_characteristics"},
{$match: {"_characteristics.name": "funny"}},
{$sort: {"_characteristics.score": -1}},
{$limit: 10}
]);
Which seems to be exactly what I want, except for the fact that according to MongoDB's documentation on using indexes in pipelines, once I call $project or $unwind in an aggregation pipeline, I can no longer utilize indexes to match or sort the collection, which renders this solution somewhat unfeasible for a very large collection.
I think you are half way there. I would do
db.users.aggregate([
{$match: { 'characteristics.name': 'funny' }},
{$unwind: '$characteristics'},
{$match: {'characteristics.name': 'funny'}},
{$project: {_id: 0, name: 1, 'characteristics.score': 1}},
{$sort: { 'characteristics.score': 1 }},
{$limit: 10}
])
I add a match stage to get rid of users without the funny attribute (which can be easily indexed).
unwind and match again to keep only the certain part of the data
keep only the necessary data with project
sort by the correct score
and limit the results.
that way you can use an index for the first match.
The way I see it, if the characteristics you are interested about are not too many, IMO it would be better to have your structure as
{
"name": "Joey",
"shy": 0.8
"funny": 0.6
"loving": 0.01
}
That way you can use an index (sparse or not) to make your life easier!

Matching for latest documents for a unique set of fields before aggregating

Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".