I have a collection of Parents that contain EmbeddedThings, and each EmbeddedThing contains a reference to the User that created it.
UserCollection: [
{
_id: ObjectId(…),
name: '…'
},
…
]
ParentCollection: [
{
_id: ObjectId(…),
EmbeddedThings: [
{
_id: 1,
userId: ObjectId(…)
},
{
_id: 2,
userId: ObjectId(…)
}
]
},
…
]
I soon realized that I need to get all EmbeddedThings for a given user, which I managed to accomplish using map/reduce:
"results": [
{
"_id": 1,
"value": [ `EmbeddedThing`, `EmbeddedThing`, … ]
},
{
"_id": 2,
"value": [ `EmbeddedThing`, `EmbeddedThing`, … ]
},
…
]
Is this where I should really just normalize EmbeddedThing into its own collection, or should I still keep map/reduce to accomplish this? Some other design perhaps?
If it helps, this is for users to see their list of EmbeddedThings across all Parents, as opposed to for some reporting/aggregation task (which made me realize I might me doing this wrong).
Thanks!
"To embed or not to embed: that is the question" :)
My rules are:
embed if an embedded object has sense only in the context of parent objects. For example, OrderItem without an Order doesn't make sense.
embed if dictated by performance requirements. It's very cheap to read full document tree (as opposed to having to make several queries and joining them programmatically).
You should look at your access patterns. If you load ParentThing several thousand times per second, and load User once a week, then map-reduce is probably a good choice. User query will be slow, but it might be ok for your application.
Yet another approach is to denormalize even more. That is, when you add an embedded thing, add it to both parent thing and user.
Pros: queries are fast.
Cons: Complicated code. Double amount of writes. Potential loss of sync (you update/delete in one place, but forget to do in another).
Related
I'm new to NoSQL so I might be wrong in my thinking process, but I am trying to figure out how to filter through possibly infinitely nested object (comment, replies to comment, replies to replies to comment). I am using MongoDB, but it probably applies to other NoSQL databases too.
This is the structure I wanted to use:
Post
{
"name": "name",
"comments": [
{
"id": "someid"
"author": "author",
"replies": [
{
"id": "someid",
"author": "author",
"replies": [
{
...
}
]
},
{
"id": "someid",
"author": "author",
"replies": null
}
]
}
]
}
As you can see, replies can be infinitely nested. (well, unless i set the limit which doesn't sound that stupid)
But now if user wants to edit / delete comment, I have to filter through them and find the one and I can't find any better way than to loop through all of them, but that would be very slow with a lot of comments.
I was thinking to create ID for each comment that would somewhat help finding it (something inspired from hashmap, but not exactly). It could maybe include depth (how deep nested is comment) and then only filter through comments with at least that depth, but that little help would only increase performance slightly and only in specific cases, in worst case I would have to loop through all of them anyway. ID could also include indexes of comments and replies, but that would be limited since ID can't be infinite and replies can.
I couldn't find any MongoDB query for that.
Is there any solution / algorithm to do it more efficiently?
Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.
I am currently building an app with Meteor and MongoDB. I have a 3 level document structure with array in array:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: [{
id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}]
}]
}
Now, the document could be manipulated at any level. Means:
description could be changed
categories can be added / removed / renamed
options can be added / removed / renamed
users can like options, so they must be added or removed
1 and 2 is quite easy. It is also relatively easy to add or remove a new option:
MyCollection.update({ _id: id, "categories.id": categoryId }, {
$push: {
"categories.$.options": {
id: Random.id
name: optionName
}
}
});
But manipulating the options hash requires to do that on javascript objects. That means I first need to find my document, iterate over the options and then write them back.
At least that's what I am doing right now. But I don't like that approach.
What I was thinking about is splitting the collection, at least to put the likes into it's own collection referencing the origin document.
Or is there another way? I don't really like both of my possible solutions.
For this kind of query one would normally use a the Mongo position operator. Although from the docs.
Nested Arrays
The positional $ operator cannot be used for queries
which traverse more than one array, such as queries that traverse
arrays nested within other arrays, because the replacement for the $
placeholder is a single value
Thus the only way to natively do what you want is by using specific indexes.
db.test.update({},{$pull:{"categories.0.options.0.likes":"abc"}})
Unfortunately Mongo does not allow to easily get the index of a match nested document.
I would normally say that once your queries become that difficult it's probably a good idea to revisit the way you store data. Also with that many arrays to which you will be pushing data, Mongo will probably be relocating a lot of documents. This is definitely something that you want to minimize.
So at this point you will need to separate your data out into different documents and even collections.
Your first documents would look like this:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: ["shtZFiTeHrPKyJ8vR"]
}]
}
This way you can easily add/remove options as you mentioned in your question. You would then need a second collection with documents that represent each option.
{
_id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}
You can learn more about references here. This is similar to what you mentioned in your comment. The benefit of this is that you are already reducing the potential amount of relocation. Depending on how you use your data you may even be reducing network usage.
Now doing updates on the likes is easy.
MyCollection.update({ _id: id}, {
$push: {likes: "value"}
});
This does, however, require you to make two queries to the db. Although on the flip side you do a lot less on the client side and a lot less bandwidth is used.
Some other questions you need to ask yourself is if that depth of nesting is really needed. There might be an easier way to go about achieving your goal that doesn't require it to become so complicated.
Currently I am working on a mobile app. Basically people can post their photos and the followers can like the photos like Instagram. I use mongodb as the database. Like instagram, there might be a lot of likes for a single photos. So using a document for a single "like" with index seems not reasonable because it will waste a lot of memory. However, I'd like a user add a like quickly. So my question is how to model the "like"? Basically the data model is much similar to instagram but using Mongodb.
No matter how you structure your overall document there are basically two things you need. That is basically a property for a "count" and a "list" of those who have already posted their "like" in order to ensure there are no duplicates submitted. Here's a basic structure:
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3")
"photo": "imagename.png",
"likeCount": 0
"likes": []
}
Whatever the case, there is a unique "_id" for your "photo post" and whatever information you want, but then the other fields as mentioned. The "likes" property here is an array, and that is going to hold the unique "_id" values from the "user" objects in your system. So every "user" has their own unique identifier somewhere, either in local storage or OpenId or something, but a unique identifier. I'll stick with ObjectId for the example.
When someone submits a "like" to a post, you want to issue the following update statement:
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": { "$ne": ObjectId("54bb2244a3a0f26f885be2a4") }
},
{
"$inc": { "likeCount": 1 },
"$push": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
Now the $inc operation there will increase the value of "likeCount" by the number specified, so increase by 1. The $push operation adds the unique identifier for the user to the array in the document for future reference.
The main important thing here is to keep a record of those users who voted and what is happening in the "query" part of the statement. Apart from selecting the document to update by it's own unique "_id", the other important thing is to check that "likes" array to make sure the current voting user is not in there already.
The same is true for the reverse case or "removing" the "like":
db.photos.update(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
"likes": ObjectId("54bb2244a3a0f26f885be2a4")
},
{
"$inc": { "likeCount": -1 },
"$pull": { "likes": ObjectId("54bb2244a3a0f26f885be2a4") }
}
)
The main important thing here is the query conditions being used to make sure that no document is touched if all conditions are not met. So the count does not increase if the user had already voted or decrease if their vote was not actually present anymore at the time of the update.
Of course it is not practical to read an array with a couple of hundred entries in a document back in any other part of your application. But MongoDB has a very standard way to handle that as well:
db.photos.find(
{
"_id": ObjectId("54bb201aa3a0f26f885be2a3"),
},
{
"photo": 1
"likeCount": 1,
"likes": {
"$elemMatch": { "$eq": ObjectId("54bb2244a3a0f26f885be2a4") }
}
}
)
This usage of $elemMatch in projection will only return the current user if they are present or just a blank array where they are not. This allows the rest of your application logic to be aware if the current user has already placed a vote or not.
That is the basic technique and may work for you as is, but you should be aware that embedded arrays should not be infinitely extended, and there is also a hard 16MB limit on BSON documents. So the concept is sound, but just cannot be used on it's own if you are expecting 1000's of "like votes" on your content. There is a concept known as "bucketing" which is discussed in some detail in this example for Hybrid Schema design that allows one solution to storing a high volume of "likes". You can look at that to use along with the basic concepts here as a way to do this at volume.
Why does MongoDB not support queries of properties of embedded documents that are stored using hashes?
For example say you have a collection called "invoices" which was created like this:
db.invoices.insert(
[
{
productsBySku: {
12432: {
price: 49.99,
qty_in_stock: 4
},
54352: {
price: 29.99,
qty_in_stock: 5
}
}
},
{
productsBySku: {
42432: {
price: 69.99,
qty_in_stock: 0
},
53352: {
price: 19.99,
qty_in_stock: 5
}
}
}
]
);
With such a structure, MongoDB queries with $elemMatch, dot syntax, or the positional operator ($) fail to access any of the properties of each productsBySku member.
For example you can't do any of these:
db.invoices.find({"productsBySku.qty_in_stock":0});
db.invoices.find({"productsBySku.$.qty_in_stock":0});
db.invoices.find({"productsBySku.qty_in_stock":{$elemMatch:{$eq:0}}});
db.invoices.find({"productsBySku.$.qty_in_stock":{$elemMatch:{$eq:0}}});
To find out-of-stock products therefore you have to resort to using a $where query like:
db.invoices.find({
$where: function () {
for (var i in this.productsBySku)
if (!this.productsBySku[i].qty_in_stock)
return this;
}
});
On a technical level... why did they design MongoDB with this very severe limitation on queries? Surely there must be some kind of technical reason for this seeming major flaw. Is this inability to deal with an a list of objects as an array, ignoring the keys, just a limitation of JavaScript as a language? Or was this the result of some architectural decision within MongoDB?
Just curious.
As a rule of thumb: Usually, these problems aren't technical ones, but problems with data modeling. I have yet to find a use case where it makes sense to have keys hold semantic value.
If you had something like
'products':[
{sku:12432,price:49.99,qty_in_stock:4},
{sku:54352,price:29.99,qty_in_stock:5}
]
It would make a lot more sense.
But: you are modelling invoices. An invoice should – for many reasons – reflect a status at a certain point in time. The ever changing stock rarely belongs to an invoice. So here is how I would model the data for items and invoices
{
'_id':'12432',
'name':'SuperFoo',
'description':'Without SuperFoo, you can't bar or baz!',
'current_price':49.99
}
Same with the other items.
Now, the invoice would look quite simple:
{ _id:"Invoice2",
customerId:"987654"
date:ISODate("2014-07-07T12:42:00Z"),
delivery_address:"Foo Blvd 42, Appt 42, 424242 Bar, BAZ"
items:
[{id:'12432', qty: 2, price: 49.99},
{id:'54352', qty: 1, price: 29.99}
]
}
Now the invoice would hold things that may only be valid at a given point in time (prices and delivery address may change) and both your stock and the invoices are queried easily:
// How many items of 12432 are in stock?
db.products.find({_id:'12432'},{qty_in_stock:1})
// How many items of 12432 were sold during July and what was the average price?
db.invoices.aggregate([
{$unwind:"$items"},
{
$match:{
"items.id":"12432",
"date":{
$gt:ISODate("2014-07-01T00:00:00Z"),
$lt:ISODate("2014-08-01T00:00:00Z")
}
}
},
{$group : { _id:"$items.id", count: { $sum:"$items.qty" }, avg:{$avg:"$items.price"} } }
])
// How many items of each product sold did I sell yesterday?
db.invoices.aggregate([
{$match:{ date:{$gte:ISODate("2014-11-16T00:00:00Z"),$lt:ISODate("2014-11-17T00:00:00Z")}}},
{$unwind:"$items"},
{$group: { _id:"$items.id",count:{$sum:"$qty"}}}
])
Combined with the query on how many items of each product you have in stock, you can find out wether you have to order something (you have to do that calculation in your code, there is no easy way to do this in MongoDB).
You see, with a "small" change, you get a lot of questions answered.
And that's basically how it works. With relational data, you model your data so that the entities are reflected properly and then you ask
How do I get my answers out of this data?
In NoSQL in general and especially with MongoDB you first ask
Which questions do I need to get answered?
and model your data accordingly. A subtle, but important difference.
If I am honest I am not sure, you would have to ask MongoDB Inc. (10gen) themselves. I will attempt to explain some of my reasoning.
I have searched on Google a little and nothing seems to appear: https://www.google.co.uk/search?q=mognodb+jira+support+querying+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=as9pVOW3OMyq8wfhtYCgCw#rls=org.mozilla:en-GB:official&channel=fflb&q=mongodb+jira+querying+objects
It is quick to see how using objectual propeties for keys could be advantageous, for example: remove queries would not have to search every object and its properties within the array but instead just find the single object property in the parent object and unset it. Essentially it would be the difference of:
[
{id:1, d:3, e:54},
{id:3, t:6, b:56}
]
and:
{
1: [d:3, e: 54],
3: [t:6, b:56]
}
with the latter, obviously, being a lot quicker to delete an id of 3.
Not only that but all array operations that MongoDB introduces, from $elemMatch to $unwind would work wth objects as well, I mean how is unwinding:
[
{id:5, d:4}
]
much different to unwinding:
{
5: {d:4}
}
?
So, if I am honest, I cannot answer your question. There is no defense on Google as to their decision and there is no extensive talk from what I can find.
In fact I went as far as to search up on this a couple of times, including: https://www.google.co.uk/search?q=array+operations+that+do+not+work+on+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=DtNpVLrwDsPo7AaH4oCoDw and I found results that went as far as underscore.js who actually comply their array functions to all objects as well.
The only real reason, I can think of, is standardisation. Instead of catering to all minority groups etc on how subdocuments may work they just cater to a single minority turned majority by their choice.
It is one of the points about MongoDB which does confuse me even now, since there are many times within my own programming where it seems advantageous for speed and power to actually use objects instead of arrays.