Find a set of documents in collection A, based on an array from collection B - mongodb

The following data needs to be stored in MongoDb:
A collection of persons (approximately 100-2000) together with their relevant attributes.
Another collection of queues (approximately 5-50).
Information about the relationsship between persons and queues. Each person can stand in line in several queues, and each queue can hold several persons. The order of the persons waiting in a queue is important.
Currently this is what i have in mind:
Persons:
{
_id: ObjectId("507c35dd8fada716c89d0001"),
first_name: 'john',
email: 'john.doe#doe.com'
id_number: 8101011234,
...
},
Queues:
{
_id: ObjectId("507c35dd8fada716c89d0011"),
title: 'A title for this queue',
people_waiting: [
ObjectId("507c35dd8fada716c89d0001"),
ObjectId("507c35dd8fada716c89d0002"),
ObjectId("507c35dd8fada716c89d0003"),
...
]
},
In a web page, I want to list (in order) all persons standing in a certain queue. I'm thinking that I first need to query the 'people_waiting' array from the 'Queues' collection. And then loop trough this array and for each item query it from the 'Persons' collection.
But there seems to be a lot of queries to generate this list, and i wonder if there is a smarter way to write/combine queries than the way described above.

You can only query one collection at a time in MongoDB, so it does take two queries. But you can use $in instead of looping through array and querying each person individually.
In the shell:
queue = db.Queues.findOne({_id: idOfQueue});
peopleWaiting = db.Persions.find({_id: {$in: queue.people_waiting}}).toArray();
But peopleWaiting will not be sorted by the order of the ids in the queue and there's no support for doing that in a MongoDB query. So you'd have to reorder peopleWaiting in your code to match the order in queue.people_waiting.

Related

Optimizing mongo queries - _id or traverse whole collection

I'm using mongodb for a project. Need to know which would be a better implementation for queries.
Consider I have to search for 10 documents out of a total 1000 documents based on a condition (not id).
Would it be better to query using document _id's (after storing the required id's in another collection beforehand by checking for the condition whenever insertion is done)
OR
Would it better to traverse all the documents and get the required documents using the condition
The main aim here is to split documents into different categories and display the documents belonging to a particular category. So storing id's of documents belonging to each category or search for documents in that category by traversing through all the documents?
I have heard that mongodb uses hashed indexing (so feel option 1 would be faster), but I couldnt find anything regarding that. So a small description regarding document storage and queries would also be good.
The optimum way to query for the cuisine type example would be to store what the restaurant serves in an array of strings or objects, and index that field.
For example:
{
name: "International House"
cuisine: [
{ name: "Chinese", subtype: "Kowloon"},
{ name: "Japanese", subtype: "Yakitori"},
{ name: "American", subtype: "TexMex" }
]
}
Then create an index on { "cuisine.name": 1 }.
When you need to find all restaurants that serve Chinese food, the query:
db.collection.find({"cuisine.name":"Chinese")
will use that index, and only scan the documents that match.

Consolidating collections for a time-line type view

Given an Meteor application that has multiple collections that need to be displayed together in a paged Facebook-style timeline view, I'm trying to decide on the best way to handle the publication of this data.
The requirements are as follows:
Documents from different collections may be intermingled in the timeline view.
The items should be sorted by a common field (the date, for example)
There should be a paged-display limit with a "Load More..." button
To solve this problem I can see two possible approaches...
Approach 1 - Overpublish
Currently I have different collections for each type of data. This poses a problem for the efficient publishing of the information that I need. For example, if the current display limit is 100 then I need to publish 100 elements of each type of collection in order to be sure of displaying the latest 100 elements of the screen.
An example may make this clearer. Assume that the timeline display shows results from collections A, B, C and D. Potentially only one of those collections may have any data, so to be sure that I have enough data to display 100 items I'll need to fetch 100 items from each collection. In that case, however, I could be fetching and sending 400 items instead!
That's really not good at all.
Then, on the client side, I need to handling merging these collections such that I show the documents in order, which probably isn't a trivial task.
Approach 2 - Combine all the collections
The second approach that occurs to me it to have one enormous server side collection of generic objects. That is, instead of having collections A, B, C, and D, I'd instead have a master collection M with a type field that describes the type of data held by the document.
This would allow me to trivially retrieve the the latest documents without over publishing.
However I'm not yet sure what the full repercussions of this approach would be, especially with packages such as aldeed:autoform and aldeed:simple-schema.
My questions are:
Does anyone here have and experience with these two approaches? If
so, what other issues should I be aware of?
Can anyone here suggest
an alternative approach?
I'd use the second approach, but do not put everything in there...
What I mean is that, for your timeline you need events, so you'd create an events collection that stores the basic information for each event (date, owner_id, etc) you'd also add the type of event and id to match another collection. So you'll keep your events just small enough to publish all is needed to then grab more details if there is a need.
You could then, either just publish your events, or publish the cursors of the other collections at the same time using the _id's to not over-publish. That event collection will become very handy for matching documents like if the user wants to see what in his timeline is related to user X or city Y...
I hope it helps you out.
I finally come up with a completely different approach.
I've created a server publication that returns the list of items ids and types to be displayed. The client can then fetch these from the relevant collections.
This allows me to maintain separate collections for each type, thus avoiding issues related to trying to maintain a Master collection type. Our data-model integrity is preserved.
At the same time I don't have to over-publish the data to the client. The workload on the server to calculate the ID list is minimal, and outweighs the disadvantages of the other two approaches by quite a long way in my opinion.
The basic publication looks like this (in Coffeescript):
Meteor.publish 'timeline', (options, limit) ->
check options, Object
check limit, Match.Optional Number
sub = this
limit = Math.min limit ? 10, 200
# We use the peerlibrary:reactive-mongo to enable meteor reactivity on the server
#ids = {}
tracker = Tracker.autorun =>
# Run a find operation on the collections that can be displayed in the timeline,
# and add the ids to an array
collections = ['A', 'B']
items = []
for collectionName in collections
collection = Mongo.Collection.get collectionName
collection.find({}, { fields: { updatedOn: 1 }, limit: limit, sort: { updatedOn: -1 }}).forEach (item) ->
item.collection = collectionName
items.push item
# Sort the array and crop it to the required length
items = items.sort (a,b) -> new Date(a.date) - new Date(b.date)
items = items[0...limit]
newIds = {}
# Add/Remove the ids from the 'timeline' collection
for doc in items
id = doc._id
newIds[id] = true
# Add this id to the publication if we didn't have it before
if not #ids[id]?
#ids[id] = moment doc.updatedOn
sub.added 'timeline', id, { collection: doc.collection, docId: id, updatedOn: doc.updatedOn }
# If the update time has changed then it needs republishing
else if not moment(doc.updatedOn).isSame #ids[id]
#ids[id] = doc.updatedOn
sub.changed 'timeline', id, { collection: doc.collection, docId: id, updatedOn: doc.updatedOn }
# Check for items that are no longer in the result
for id of #ids
if not newIds[id]?
sub.removed 'timeline', id
delete #ids[id]
sub.onStop ->
tracker.stop()
sub.ready()
Note that I'm using peerlibrary:reactive-publish for the server-side autorun.
The queries fetch just the latest ids from each collection, then it places them into a single array, sorts them by date and crops the array length to the current limit.
The resulting ids are then added to the timeline collection, which provides for a reactive solution on the client.
On the client it's a simply a matter of subscripting to this collection, then subscribing the individual item subscriptions themselves. Something like this:
Template.timelinePage.onCreated ->
#autorun =>
#limit = parseInt(Router.current().params['limit']) || 10
sub = #subscribe 'timeline', {}, #limit
if sub.ready()
items = Timeline.find().fetch()
As = _.pluck _.where(items, { collection: 'a' }), 'docId'
#aSub = #subscribe 'a', { _id: { $in: As }}
Bs = _.pluck _.where(items, { collection: 'b' }), 'docId'
#bSub = #subscribe 'b', { _id: { $in: Bs }}
Finally, the template can iterate one the timeline subscription and display the appropriate item based on its type.

How can I implement an ordered array with mongodb without race-conditions?

I'm new to mongodb, maybe this is a trivial question. I have two mongodb collections: user and post. A user can create and follow multiple posts, and posts are listed sorted by last modification date. There may be a very large number of users following a specific post, so I don't want to keep the list of followers in each post document. On the other hand, one user will probably not follow more than a few thousand posts, so I decided to keep the list of followed posts' objectids in each user document.
In order to be able to quickly list the 50 most recently modified posts for a given user, I chose to keep the last_updated_at field along with the post objectid.
The post document is fairly basic:
{
"_id" : ObjectId("5163deebe4d809d55d27e847"),
"title" : "All about music"
"comments": [...]
...
}
The user document looks like this:
{
"_id": ObjectId("5163deebe4d809d55d27e846"),
"posts": [{
"post": ObjectId("5163deebe4d809d55d27e847"),
"last_updated_at": ISODate("2013-04-09T11:27:07.184Z")
}, {
"post": ObjectId("5163deebe4d809d55d27e847"),
"last_updated_at": ISODate("2013-04-09T11:27:07.187Z")
}]
...
}
When a user creates or follows a post, I can simply $push the post's ObjectId and last_updated_at to the end of the posts list in the user's document. When a post is modified (for example when a comment is added to the post), I update the last_updated_at field for that post in all the follower's user documents. That's pretty heavy, but I don't know how to avoid it.
When I want to get the list of 50 most recently updated posts for a user, I unfortunately need to get the whole list of followed posts, then sort by last_updated_at in memory, then keep only the first 50 posts.
So I tried to change the implementation to reorder the list when a post is modified: I $push it to the end of the list, and $pull it from wherever it is. Since this is a two step procedure, there's a race condition where I might get twice the same post in the list. Is there no better way to maintain a sorted array in mongodb?
Data model adjustment
Since you may have frequent updates to the latest posts for a given user, you probably want to avoid the overhead of rewriting data unnecessarily to maintain a sorted array.
A better approach to consider would be to flatten the data model and use a separate collection instead of an ordered array:
create a separate collection with the updated post stream: (userID, postID, lastUpdated)
when a post is updated, you can then do a simple update() with the multi:true and upsert:true options and $set the last_updated_at to the new value.
to retrieve the last 50 updated posts for a given userID you can do a normal find() with sort and limit options.
to automatically clean up the "old" documents you could even set a TTL expiry for this collection so the updates are removed from the activity stream after a certain number of days
Pushing to fixed-size & sorted arrays in MongoDB 2.4
If you do want to maintain ordered arrays, MongoDB 2.4 added two helpful features related to this use case:
Ability to push to fixed-sized arrays
Ability to push to arrays sorted by embedded document fields
So you can achieve your outcome of pushing to a fixed-sized array of 50 items sorted by last updated date descending:
db.user.update(
// Criteria
{ _id: ObjectId("5163deebe4d809d55d27e846") },
// Update
{ $push: {
posts: {
// Push one or more updates onto the posts array
$each: [
{
"post": ObjectId("5163deebe4d809d55d27e847"),
"last_updated_at": ISODate()
}
],
// Slice to max of 50 items
$slice:-50,
// Sorted by last_updated_at desc
$sort: {'last_updated_at': -1}
}
}}
)
The $push will update the list in sorted order, with the $slice trimming the list to the first 50 items. Since the posts aren't unique you'll still need to $pull the original from the list first, eg:
db.user.update(
// Criteria
{ _id: ObjectId("5163deebe4d809d55d27e846") },
// Update
{
$pull: {
posts: { post: ObjectId("5163deebe4d809d55d27e847") }
}
}
)
A benefit of this approach is that array manipulation is being done on the server, but as with sorting the array in your application you may still be updating the document more than is required.

MongoDB - Query embbeded documents

I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.

Full Join/Intersection in couchdb

I have some documents which have 2 sets of attributes: tag and lieu. Here is an example of what they look like:
{
title: "doc1",
tag: ["mountain", "sunny", "forest"],
lieu: ["france", "luxembourg"]
},
{
title: "doc2",
tag: ["sunny", "lake"],
lieu: ["france", "germany"]
},
{
title: "doc3",
tag: ["sunny"],
lieu: ["belgium", "luxembourg", "france"]
}
How can I map/reduce and query my DB to be able to retrieve only the intersection of documents that match these criteria:
lieu: ["france", "luxembourg"]
tag: ["sunny"]
Returns: doc1 and doc3
I cannot figure out any format map/reduce could return to be able to have only one query. What I am doing now is: emit every lieu/tag as key and the documents' id related as value, then reduce for every keys have an array of docs' ids. Then from my app I query this view, on the app side do an intersection of the documents (only take the docs that have the 3 keys (luxembourg, france and sunny) and then requery couchdb with these docs' ids to retrieve the actual docs. I feel that's not the right/best way to do it?
I am using lists to do the intersection job, it works quite well. But I still need to do an other request to get the documents using the documents ids. Any idea what could I do differently to retrieve the documents directly?
Thank you!
This is going to be awkward. The basic idea is that you have to build a view where the map function emits every possible combination of tags and countries as the key, and there's no reduce function. This way, looking for ["france","luxembourg"] would return all documents that emitted that key (and therefore are in the intersection), because views without a reduce function return the emitting document for every entry. This way, you only have to do one request.
This causes a lot of emits to happen, but you can lower that number by sorting the tags both when emitting and when searching (automatically turn ["luxembourg","france"] into ["france","luxembourg"]), and by taking advantage of the ability of CouchDB to query prefixes (this means that emitting ["belgium","france","luxembourg"] will let you match searches for ["belgium"] and ["belgium","france"]).
In your example above, for the countries, you would only emit:
// doc 1
emit(["luxembourg"],null);
emit(["france","luxembourg"],null);
// doc 2
emit(["germany"],null);
emit(["france","germany"],null);
// doc 3
emit(["luxembourg"],null);
emit(["belgium","luxembourg"],null);
emit(["france","luxembourg"],null);
emit(["belgium","france","luxembourg"],null);
Anyway, for complex queries like this one, consider looking into a CouchDB-Lucene combination.