MongoDB - Manipulating multi-level arrays in a document - mongodb

I am currently building an app with Meteor and MongoDB. I have a 3 level document structure with array in array:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: [{
id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}]
}]
}
Now, the document could be manipulated at any level. Means:
description could be changed
categories can be added / removed / renamed
options can be added / removed / renamed
users can like options, so they must be added or removed
1 and 2 is quite easy. It is also relatively easy to add or remove a new option:
MyCollection.update({ _id: id, "categories.id": categoryId }, {
$push: {
"categories.$.options": {
id: Random.id
name: optionName
}
}
});
But manipulating the options hash requires to do that on javascript objects. That means I first need to find my document, iterate over the options and then write them back.
At least that's what I am doing right now. But I don't like that approach.
What I was thinking about is splitting the collection, at least to put the likes into it's own collection referencing the origin document.
Or is there another way? I don't really like both of my possible solutions.

For this kind of query one would normally use a the Mongo position operator. Although from the docs.
Nested Arrays
The positional $ operator cannot be used for queries
which traverse more than one array, such as queries that traverse
arrays nested within other arrays, because the replacement for the $
placeholder is a single value
Thus the only way to natively do what you want is by using specific indexes.
db.test.update({},{$pull:{"categories.0.options.0.likes":"abc"}})
Unfortunately Mongo does not allow to easily get the index of a match nested document.
I would normally say that once your queries become that difficult it's probably a good idea to revisit the way you store data. Also with that many arrays to which you will be pushing data, Mongo will probably be relocating a lot of documents. This is definitely something that you want to minimize.
So at this point you will need to separate your data out into different documents and even collections.
Your first documents would look like this:
{
_id: "shtZFiTeHrPKyJ8vR",
description: "Some title",
categories: [{
id: "shtZFiTeHrPKyJ8vR",
name: "Foo",
options: ["shtZFiTeHrPKyJ8vR"]
}]
}
This way you can easily add/remove options as you mentioned in your question. You would then need a second collection with documents that represent each option.
{
_id: "shtZFiTeHrPKyJ8vR",
name: "bar",
likes: ["abc", "bce"]
}
You can learn more about references here. This is similar to what you mentioned in your comment. The benefit of this is that you are already reducing the potential amount of relocation. Depending on how you use your data you may even be reducing network usage.
Now doing updates on the likes is easy.
MyCollection.update({ _id: id}, {
$push: {likes: "value"}
});
This does, however, require you to make two queries to the db. Although on the flip side you do a lot less on the client side and a lot less bandwidth is used.
Some other questions you need to ask yourself is if that depth of nesting is really needed. There might be an easier way to go about achieving your goal that doesn't require it to become so complicated.

Related

Good DB-design to reference different collections in MongoDB

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?
If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

Which is a better MondoDB schema design?

In general, which is a better schema and why? I run into this same problem over and over again, and it seems to be mixed online with which is better.
The first schema has a single document for a location ID with a nested menu:
{
locationID: "xyz",
menu: [{item: "a", price: 1.0}, {item: "b", price: 2.0}...]
}
The second schema has multiple documents for a given location ID
{
locationID: "xyz",
item: "a",
price: 1.0
},
{
locationID: "xyz",
item: "b",
price: 2.0
}
The first schema seems like it's faster and doesn't duplicate the location ID so uses slightly less memory. The second schema seems slower since it has to gather the documents (perhaps it's indexed alphabetically though?), but it's so much easier to modify.
Is there a "firm" answer or guideline on this?
For the actual data you showed above, I would opt for the first design. The first design allows all menu items for a single location to be stored in, and retrieved from, a single Mongo document. As you pointed out, this would probably be faster than when using the second more normalized design.
As to when you might use the second version, consider the case where the menu item metadata be relatively large. For example, let's say that you wanted to store a 10KB image for each menu item. MongoDB has a limit of 16MB as the maximum size for a single BSON document. For locations with several hundred menu items, you might not be able to fit all menu items and their metadata info a single document. In such a case, the first option might be out and you would be forced to use the second option (or something else).

Mongodb: store all related data in one collection or abstract pieces of data from each other?

Schema:
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments_1: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
],
comments_2: [
{
// all comments at once
articleId: uid,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
],
}
],
I'm a bit confused with mongodb recommendations:
Say, i need to retrieve information for an article page. I'll need to do 2 requests, first to find article by id, and the second to find comments. If i'd include comments (comments_2) as property into each article, i'd need to perform only one query to get all the data i need, and if i'd need to list say, titles of 20 articles, i'd perform a query with specified properties to be retrieved, right?
Should i store comments and articles in different collections?
If comments will be in different store, should i store comments the comments_1 way or comments_2 way?
I'll avoid deep explanations, because the schema explains my point clearly, i guess. Briefly, i don't get if it's better to store everything in one place and then specify properties i want to retrieve while querying, or abstract pieces of data to different collections?
In a relational database, this would be achieved by JOIN. Apparently, there is a NoSQL equivalent in MongoDB, starting from version 3.2 called $lookup
This allows you to keep comments and articles in separate schemas, but still retrieve list of comments for an article with a single query.
Stack Overflow Source
It's a typical trade-off you have to make. Both approaches have their own pros and cons and you have to choose what fits best for your use case. Couple of inputs:
Single table:
fast load single article, since you load all data in one query
no issues with loading titles of 20 articles (you can query only subset of fields using projection
Multiple table:
much easier to do perpendicular queries (e.g comments made by specific user, etc)
I would go with version 1, since it's simpler and version 2 won't give you any advantage
Well, MongoDB models are usually meant to hold data and relationship together since it doesn't provides JOINS ($lookup is the nearest to join and costly, best to avoid).
That's why in DB modeling there is huge emphasis on denormalization, since there are two benefits of storing together
You wouldn't have to join the collections and you can get the data in a single query.
Since mongo provides atomic update, you can update comments and article in one go, not worrying about transaction and rollback.
So almost certainly you would like to put comments inside article collection. So it would be something like
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
]
}
]
Before we agree to it, let us see the drawback of above approach.
There is a limit of 16MB per document which is huge, but think if the text of your article is large and the comments on that article is also in large number, maybe it can cross 16 MB.
All the places where you get article for other purposes you might have to exclude the comments field, otherwise it would be heavy and slow.
If you have to do aggregation again we might get into memory limit issue if we need to aggregate based on comments also one way or other.
These are serious problem, and we cannot ignore that, now we might want to keep it in different collection and see what we are losing.
First of all comment and articles though linked but are different entity, so you might never need to update them together for any field.
Secondly, you would have to load comments separately, which makes sense in normal use-case, in most application that's how we proceed, so that too is not an issue.
So in my opinion clear winner is having two separate collection
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
]
You wouldn't want to go comment_2 way if you are choosing for two collection approach, again for same reason as what if there are huge comments for a single article.

Why does MongoDB not support queries of properties of embedded documents that are stored in hashed arrays?

Why does MongoDB not support queries of properties of embedded documents that are stored using hashes?
For example say you have a collection called "invoices" which was created like this:
db.invoices.insert(
[
{
productsBySku: {
12432: {
price: 49.99,
qty_in_stock: 4
},
54352: {
price: 29.99,
qty_in_stock: 5
}
}
},
{
productsBySku: {
42432: {
price: 69.99,
qty_in_stock: 0
},
53352: {
price: 19.99,
qty_in_stock: 5
}
}
}
]
);
With such a structure, MongoDB queries with $elemMatch, dot syntax, or the positional operator ($) fail to access any of the properties of each productsBySku member.
For example you can't do any of these:
db.invoices.find({"productsBySku.qty_in_stock":0});
db.invoices.find({"productsBySku.$.qty_in_stock":0});
db.invoices.find({"productsBySku.qty_in_stock":{$elemMatch:{$eq:0}}});
db.invoices.find({"productsBySku.$.qty_in_stock":{$elemMatch:{$eq:0}}});
To find out-of-stock products therefore you have to resort to using a $where query like:
db.invoices.find({
$where: function () {
for (var i in this.productsBySku)
if (!this.productsBySku[i].qty_in_stock)
return this;
}
});
On a technical level... why did they design MongoDB with this very severe limitation on queries? Surely there must be some kind of technical reason for this seeming major flaw. Is this inability to deal with an a list of objects as an array, ignoring the keys, just a limitation of JavaScript as a language? Or was this the result of some architectural decision within MongoDB?
Just curious.
As a rule of thumb: Usually, these problems aren't technical ones, but problems with data modeling. I have yet to find a use case where it makes sense to have keys hold semantic value.
If you had something like
'products':[
{sku:12432,price:49.99,qty_in_stock:4},
{sku:54352,price:29.99,qty_in_stock:5}
]
It would make a lot more sense.
But: you are modelling invoices. An invoice should – for many reasons – reflect a status at a certain point in time. The ever changing stock rarely belongs to an invoice. So here is how I would model the data for items and invoices
{
'_id':'12432',
'name':'SuperFoo',
'description':'Without SuperFoo, you can't bar or baz!',
'current_price':49.99
}
Same with the other items.
Now, the invoice would look quite simple:
{ _id:"Invoice2",
customerId:"987654"
date:ISODate("2014-07-07T12:42:00Z"),
delivery_address:"Foo Blvd 42, Appt 42, 424242 Bar, BAZ"
items:
[{id:'12432', qty: 2, price: 49.99},
{id:'54352', qty: 1, price: 29.99}
]
}
Now the invoice would hold things that may only be valid at a given point in time (prices and delivery address may change) and both your stock and the invoices are queried easily:
// How many items of 12432 are in stock?
db.products.find({_id:'12432'},{qty_in_stock:1})
// How many items of 12432 were sold during July and what was the average price?
db.invoices.aggregate([
{$unwind:"$items"},
{
$match:{
"items.id":"12432",
"date":{
$gt:ISODate("2014-07-01T00:00:00Z"),
$lt:ISODate("2014-08-01T00:00:00Z")
}
}
},
{$group : { _id:"$items.id", count: { $sum:"$items.qty" }, avg:{$avg:"$items.price"} } }
])
// How many items of each product sold did I sell yesterday?
db.invoices.aggregate([
{$match:{ date:{$gte:ISODate("2014-11-16T00:00:00Z"),$lt:ISODate("2014-11-17T00:00:00Z")}}},
{$unwind:"$items"},
{$group: { _id:"$items.id",count:{$sum:"$qty"}}}
])
Combined with the query on how many items of each product you have in stock, you can find out wether you have to order something (you have to do that calculation in your code, there is no easy way to do this in MongoDB).
You see, with a "small" change, you get a lot of questions answered.
And that's basically how it works. With relational data, you model your data so that the entities are reflected properly and then you ask
How do I get my answers out of this data?
In NoSQL in general and especially with MongoDB you first ask
Which questions do I need to get answered?
and model your data accordingly. A subtle, but important difference.
If I am honest I am not sure, you would have to ask MongoDB Inc. (10gen) themselves. I will attempt to explain some of my reasoning.
I have searched on Google a little and nothing seems to appear: https://www.google.co.uk/search?q=mognodb+jira+support+querying+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=as9pVOW3OMyq8wfhtYCgCw#rls=org.mozilla:en-GB:official&channel=fflb&q=mongodb+jira+querying+objects
It is quick to see how using objectual propeties for keys could be advantageous, for example: remove queries would not have to search every object and its properties within the array but instead just find the single object property in the parent object and unset it. Essentially it would be the difference of:
[
{id:1, d:3, e:54},
{id:3, t:6, b:56}
]
and:
{
1: [d:3, e: 54],
3: [t:6, b:56]
}
with the latter, obviously, being a lot quicker to delete an id of 3.
Not only that but all array operations that MongoDB introduces, from $elemMatch to $unwind would work wth objects as well, I mean how is unwinding:
[
{id:5, d:4}
]
much different to unwinding:
{
5: {d:4}
}
?
So, if I am honest, I cannot answer your question. There is no defense on Google as to their decision and there is no extensive talk from what I can find.
In fact I went as far as to search up on this a couple of times, including: https://www.google.co.uk/search?q=array+operations+that+do+not+work+on+objects&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a&channel=fflb&gfe_rd=cr&ei=DtNpVLrwDsPo7AaH4oCoDw and I found results that went as far as underscore.js who actually comply their array functions to all objects as well.
The only real reason, I can think of, is standardisation. Instead of catering to all minority groups etc on how subdocuments may work they just cater to a single minority turned majority by their choice.
It is one of the points about MongoDB which does confuse me even now, since there are many times within my own programming where it seems advantageous for speed and power to actually use objects instead of arrays.

In MongoDB, when to use a simple subdocument, when an array with 2-field elements?

Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.