Say I have two documents
/*1*/
{
"id":1,
"name":"natty",
"subject_enrolled":"english"
}
/*2*/
{
"id":2,
"name":"natty",
"subject_enrolled":"science"
}
Ideally, it should have been same document, with subject_enrolled being an array having both subjects. But for some reason, I maintained my data flat like this.
Now, I want to write a query which will retrieve all students who have enrolled for both "english" and "science".
I tried the below query:
db.students.find({"subject_enrolled":{"$in":["science", "english"]}})
But that is wrong, coz if any student who registered for only science will also be in the result. I cannot use "$all", as both science and english are in two different documents.
Is there a way to achieve this easily and effectively?
All I could think of now is to use an aggregate.
db.students.aggregate([{"$group":{"_id":"$name", "subject_enrolled":{"$addToSet":"$subject_enrolled"}}}, {"$match":{"subject_enrolled":{"$all":["english", "science"]}}}])
This satisfies my condition perfectly.
But I am worried about performance only. If I have say 10,000 documents, worst case I have all documents are for individual students and the querying Param is also a single value, then I will be aggregating for no good!
Mongo experts, please share your views on my situation.
Related
I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.
I'am looking for a good solution on MongoDB for this problem:
There are some Category's and every Category has X items.
But some items can be in in "many" Category's!
I was looking for something like a symbolic link on Unix systems but I could't not find it.
What i thought is a good idea is:
"Category1/item1" is the Object and "category2/item44232" is only a reference to "item1" so when i change "item1" it also changes "item44232".
I looked into the MongoDB Data models documentation but there is no real solution for this.
Thank you for your response !
In RDBMSs, you use a join table to represent one-to-many relationships; in MongoDB, you use array keys. For example each product contains an array of category IDs, and both products and categories get their own collections. If you have two simple category documents
{ _id: ObjectId("4d6574baa6b804ea563c132a"),
title: "Epiphytes"
}
{ _id: ObjectId("4d6574baa6b804ea563c459d"),
title: "Greenhouse flowers"
}
then a product belonging to both categories will look like this:
{ _id: ObjectId("4d6574baa6b804ea563ca982"),
name: "Dragon Orchid",
category_ids: [ ObjectId("4d6574baa6b804ea563c132a"),
ObjectId("4d6574baa6b804ea563c459d") ]
}
for more: http://docs.mongodb.org/manual/reference/database-references/
Try looking at the problem inside-out: Instead of having items inside categories, have the items list the categories they belong into.
You'll be able to easily find all items that belong to a category (or even multiple categories), and there is no duplication nor any need to keep many instances of the same item updated.
This can be very fast and efficient, especially if you index the list of categories. Check out multikey indexes.
I playing with the best way to model mongodb documents
I am modelling a school.
A Student has many subjects.
Student{
subjects:[ {name:'',
level:'',
short name:''
},
{...},
{...}]
}
Decided to denormalise and embed subjects into students for performance.
There are rare cases where a subject needs to be queried and updated.
subjects.all
subject1.short_name = 'something new'
I know I will have to iterate through every student to update every subject reocrd.
However whast the best way to return all unique subjects?
Can you do a unique search of student.subjects names for example?
Or is it better to have another collection which is
Subjects{
name:'',
level:'',
short name:''
}
I still keep the denormalised Student.subject. But this is simply there for quering all the subjects on offer.
An updated would update this + every embeded Student.subject?
Any suggestions/recommendations?
However whast the best way to return all unique subjects?
This is a short fall of your schema here. You traded the ability to do this kind of thing easily in return for other speed benefits that you would do more often.
Currently the only real way is to either use the distinct() command ( http://docs.mongodb.org/manual/reference/method/db.collection.distinct/ ):
db.students.distinct('subjects.name');
or the aggregation framework:
db.students.aggregate([
{$unwind:'$subjects'},
{$group:{_id:'$subjects.name'}}
])
Like so.
As for schema recommendation, if you intend to make this kind of query often then I would factor out subjects into a separate collection.
I have some mongodb object let's call it place which contains geo information, look at the example:
{
"_id": "234235425e3g33424".
"geo": {
"lon": 12.23456,
"lat": 34.23322
}
"some_field": "value"
}
With every place, a list of features is associated with:
{
"_id": "2334sgfgsr435d",
"place_id": "234235425e3g33424",
"feature_field" : "some_value"
}
As you see features are linked to places thanks to place_id field. Now I would like to find: list of features connected with nearest places. But I would like also add search contition on place.some_field and feature.feature_field. And what is important I would like to limit results.
Now I am using such approach:
I query on places with condition on geo and some_filed
I query on features with condition on feature_field and place_id (limit only to ones found in 1.)
I limit results in my application code
My question is: is there better approach to such task? Now I cannot use mongo limit() function, as when I do it on places I can end with too few results as I need to make second query. I cannot limit() on second query as results will come up with random order, and I would like to sort it by distance.
I know I can put data into one document, but I presume that list of features will be long and I can exceed BSON size limit.
Running out of 16mb for just the features seems unlikely... but it's possible. I don't think you realize how much 16mb is, so do the maths before assuming anything!
In any case, with MongoDB you can not do a query with fields from two collections. A query always deals with one specific collection only. I have done a very similar thing than what you have here though, which i've described in an article: http://derickrethans.nl/indexing-free-tags.html — have a look at that for some more inspiration.
I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.