MongoDB full text search and the aggregation pipeline - mongodb

I am currently using the full-text search capabilities of MongoDB to count the number of documents per hour which contain a certain keyword.
This is really interesting when run across a large collection where each document is a Tweet. For example for the keyword "thanks" we see Nov 29 (Thanks Giving).
My current approach works (it generated the above plot) but it is not going to scale. At the moment I manually count the number of tweets in each hour by iterating over the documents returned by search. This approach is not going to scale as this search result will eventually reach the MongoDB document limit. At the moment it works because I have only 3.5 million tweets but I plan on collecting a lot more.
data = db.command('text', collection,
search=query,
project={'hour_bucket': 1, '_id': 0},
limit=-1
)
hours = Counter()
for d in data['results']:
hours[d['obj']['hour_bucket']] += 1
My question is: can text-search be used inside the aggregation pipeline? This would fix all of my problems. However the only comment I have seen about this is the following: https://jira.mongodb.org/browse/SERVER-9063
Does anyone know what the status of this work is?

Somewhat coincidentally, support for text search in the aggregation framework has recently been committed with a tagged fixVersion for the upcoming MongoDB 2.5.5 development/unstable release (see SERVER-11675).
Assuming all goes well in QA/testing, this feature will be included in the 2.6 production release.
There should be some further information included in the draft 2.6 release notes after 2.5.5 is released, and I would encourage you to test this feature in your development environment.
FYI, you can find or subscribe to release announcements via the mongodb-announce discussion group.

Manual: http://docs.mongodb.org/manual/tutorial/text-search-in-aggregation/
Example:
db.tweets.aggregate(
[
{ $match : { $text: { $search: "query" } } },
{ $project : { day : { $substr: ["$created_at", 0, 10]}}},
{ $group : { _id : "$day", number : { $sum : 1 }}},
{ $sort : { _id : 1 }}
]
)

Related

MongoDB aggregation $lookup to a field that is an indexed array

I am trying a fairly complex aggregate command on two collections involving $lookup pipeline. This normally works just fine on simple aggregation as long as index is set on foreignField.
But my $lookup is more complex as the indexed field is not just a normal Int64 field but actually an array of Int64. When doing a simple find(), it is easy to verify using explain() that the index is being used. But explaining the aggregate pipeline does not explain whether index is being used in the $lookup pipeline. All my timing tests seem to indicate that the index is not being used. MongoDB version is 3.6.2. Db compatibility is set to 3.6.
As I said earlier, I am not using simple foreignField lookup but the 3.6-specific pipeline + $match + $expr...
Could using pipeline be showstopper for the index? Does anyone have any deep experience with the new $lookup pipeline syntax and/or the index on an array field?
Examples
Either of the following works fine and if explained, shows that index on followers is being used.
db.col1.find({followers: {$eq : 823778}})
db.col1.find({followers: {$in : [823778]}})
But the following one does not seem to make use of the index on followers [there are more steps in the pipeline, stripped for readability].
db.col2.aggregate([
{$match:{field: "123"}},
{$lookup:{
from: "col1",
let : {follower : "$follower"},
pipeline: [{
$match: {
$expr: {
$or: [
{ $eq : ["$follower", "$$follower"] },
{ $in : ["$$follower", "$followers"]}
]
}
}
}],
as: "followers_all"
}
}])
This is a missing feature which is going to part of 3.8 version.
Currently eq matches in lookup sub pipeline are optimised to use indexes.
Refer jira fixed in 3.7.1 ( dev version).
Also, this may be relevant as well for non-multi key indexes.

Elasticsearch and subsequent Mongodb queries

I am implementing search functionality using Elasticsearch.
I receive "username" set returned by Elasticsearch after which I need to query a collection in MongoDB for latest comment of each user in the "username" set.
Question: Lets say I receive ~100 usernames everytime I query Elasticsearch what would be the fastest way to query MongoDB to get the latest comment of each user. Is querying MongoDB 100 times in a for loop using .findOne() the only option?
(Note - Because latest comment of a user changes very often, I dont want to store it in Elasticsearch as that will trigger retrieve-change-reindex process for the entire document far too frequently)
This answer assumes following schema for your mongo db stored in comments db.
{
"_id" : ObjectId("5788b71180036a1613ac0e34"),
"username": "abc",
"comment": "Best"
}
assuming usernames is the list of users you get from elasticsearch, you can perform following aggregate:
a =[
{$match: {"username":{'$in':usernames}}},
{$sort:{_id:-1}},
{
$group:
{
_id: "$username",
latestcomment: { $first: "$comment" }
}
}
]
db.comments.aggregate(a)
You can try this..
db.foo.find().sort({_id:1}).limit(100);
The 1 will sort ascending (old to new) and -1 will sort descending (new to old.)

Group Aggregation performance in MongoDB

I have a large amount of data captured by my apis, like this:
{
"_id" : ObjectId("57446a89e5b49e297031fab8"),
"applicationVersion" : "X.X.XXX.X",
"createdDate" : ISODate("2016-05-16T23:00:00.007Z"),
"identifier" : "v2/events/messages",
"durationInMilliseconds" : NumberLong(14)
}
I want to group the whole collection by the identifier. So I use the aggregation framework
$group : {
_id : {
identifier : "$identifier"
},
count : {
$sum : 1
}
}
I have an index on identifer.
This is a simple count, i may want to work out average api response times and things like that, but the speed is putting me off.
On 7 million documents the aggregation takes around 10 seconds. If I do the equivalent group by in SQL on MSSQL it takes less than a second.
Is there a way I can optimize this type of aggregation or do I need to think about this differently e.g.
changing how I collect the data
use a different tool?
MongoDB doesn't use indexes in aggregation framework except $match and $sort if used as first stage in aggregation framework. This is limitation and we can hope for improvement in future.
See Pipeline Operators and Indexes in MongoDB

Aggregation: Project dotted field doesn't seem to work

I have a database containing this document:
{"_id":{"$id":"xxx"},"duration":{"sec":137,"usec":0},"name":"test"}
If I call db.collection.aggregate with this pipeline:
{$project:{_id: 0, name: 1, duration: 1, seconds: "$duration.sec"}}
I get this result:
{"result":[{"duration":{"sec":137,"usec":0},"name":"test"}],"ok":1}
Why does the result not have a 'seconds' field? Have I used the wrong projection syntax?
I'm not entirely sure of the version of mongodb the server is running. I'm using the 1.3.1 php driver with php 5.4.3, but the server may be older than that - perhaps by about half a year?
According to the MongoDB documentation on $project:
You may also use $project to rename fields. Consider the following
example:
db.article.aggregate(
{ $project : {
title : 1 ,
page_views : "$pageViews" ,
bar : "$other.foo"
}} );
This operation renames the pageViews field to page_views, and renames the foo field in the other sub-document as the top-level
field bar.
That example seems to match-up pretty good with what you are trying to do.
I know 10gen officially released the aggregation framework with MongoDB v2.2. Check out the current production release, which I believe is 2.2.3. If you are running on a prior development version, there could be something odd going on with aggregation.
As Bryce said, I'm currently using MongoDB 2.6 through the shell and the $project pipeline is working for renaming nested fields as you do.
db.article.aggregate({$project:{'_id': 0, 'name': 1, 'duration': 1, 'seconds': '$duration.sec'}}
I've not tried yet trough the python or php drivers but my former pipelines with the last pymongo worked very well.

Sorting hybrid bucketed schema in MongoDB

Our application allows users to create posts and comments. Data is growing fast and we already reviewed Mongodb scaling strategies. We like the approach presented in http://www.10gen.com/presentations/mongosf2011/schemascale , which uses a hybrid schema between embedded and non-embedded documents, bucketing comments so that they are saved in groups of 100 or 200 comments per document.
{
"_id" : '/post/2323423/1--1',
"comments" : [{
"author" : "peter",
"text" : "comment!",
"when" : "June 24 2012,
"votes": 43
},
{
"author" : "joe",
"text" : "hi!",
"when" : "June 25 2012,
"votes": 102
},
...
],
}
By bucketing comments, fewer disk reads are necessary to display thousands of comments, while at the same time, documents are kept small so writes are fast. It's perfect to paginate comments sorted by date.
We are very interesented in this approach but our application requires comments to be sorted by votes and subcomments.
Currently we use a non-embedded approach which uses a separate collection for comments. Allows us to retrieve data sorted by any field and subcommenting is easy (by reference), but performance is becoming an issue. We would like to use bucketing but the sorting by votes thing does not seem to fit in a bucket.
Sorting by date is trivial, just go for the next bucket as the user clicks 'next page', quering one document. But, how do we manage to do this if we want to sort by votes? we'd have to retrieve all buckets and then sort the comments, which is obviously inneficient...
Any ideas about a proper schema design to accomplish this?
You should be able to sort by descending:
db.collection.find({},{_id:0}).sort({'comments.votes':1})
Just note that there is a bug where you can only sort by ascending.
See this bug ticket
have you tried an aggregation query?
db.commentbuckets.aggregate([
$match: {discussion_id: <discussion_id>},
$unwind: "$comments",
$sort: {votes: -1}
]);