How to deal with Many-to-Many relations in MongoDB when Embedding is not the answer? - mongodb

Here's the deal. Let's suppose we have the following data schema in MongoDB:
items: a collection with large documents that hold some data (it's absolutely irrelevant what it actually is).
item_groups: a collection with documents that contain a list of items._id called item_groups.items plus some extra data.
So, these two are tied together with a Many-to-Many relationship. But there's one tricky thing: for a certain reason I cannot store items within item groups, so -- just as the title says -- embedding is not the answer.
The query I'm really worried about is intended to find some particular groups that contain some particular items (i.e. I've got a set of criteria for each collection). In fact it also has to say how much items within each found group fitted the criteria (no items means group is not found).
The only viable solution I came up with this far is to use a Map/Reduce approach with a dummy reduce function:
function map () {
// imagine that item_criteria came from the scope.
// it's a mongodb query object.
item_criteria._id = {$in: this.items};
var group_size = db.items.count(item_criteria);
// this group holds no relevant items, skip it
if (group_size == 0) return;
var key = this._id.str;
var value = {size: group_size, ...};
emit(key, value);
}
function reduce (key, values) {
// since the map function emits each group just once,
// values will always be a list with length=1
return values[0];
}
db.runCommand({
mapreduce: item_groups,
map: map,
reduce: reduce,
query: item_groups_criteria,
scope: {item_criteria: item_criteria},
});
The problem line is:
item_criteria._id = {$in: this.items};
What if this.items.length == 5000 or even more? My RDBMS background cries out loud:
SELECT ... FROM ... WHERE whatever_id IN (over 9000 comma-separated IDs)
is definitely not a good way to go.
Thank you sooo much for your time, guys!
I hope the best answer will be something like "you're stupid, stop thinking in RDBMS style, use $its_a_kind_of_magicSphere from the latest release of MongoDB" :)

I think you are struggling with the separation of domain/object modeling from database schema modeling. I too struggled with this when trying out MongoDb.
For the sake of semantics and clarity, I'm going to substitute Groups with the word Categories
Essentially your theoretical model is a "many to many" relationship in that each Item can belong Categories, and each Category can then possess many Items.
This is best handled in your domain object modeling, not in DB schema, especially when implementing a document database (NoSQL). In your MongoDb schema you "fake" a "many to many" relationship, by using a combination of top-level document models, and embedding.
Embedding is hard to swallow for folks coming from SQL persistence back-ends, but it is an essential part of the answer. The trick is deciding whether or not it is shallow or deep, one-way or two-way, etc.
Top Level Document Models
Because your Category documents contain some data of their own and are heavily referenced by a vast number of Items, I agree with you that fully embedding them inside each Item is unwise.
Instead, treat both Item and Category objects as top-level documents. Ensure that your MongoDb schema allots a table for each one so that each document has its own ObjectId.
The next step is to decide where and how much to embed... there is no right answer as it all depends on how you use it and what your scaling ambitions are...
Embedding Decisions
1. Items
At minimum, your Item objects should have a collection property for its categories. At the very least this collection should contain the ObjectId for each Category.
My suggestion would be to add to this collection, the data you use when interacting with the Item most often...
For example, if I want to list a bunch of items on my web page in a grid, and show the names of the categories they are part of. It is obvious that I don't need to know everything about the Category, but if I only have the ObjectId embedded, a second query would be necessary to get any detail about it at all.
Instead what would make most sense is to embed the Category's Name property in the collection along with the ObjectId, so that pulling back an Item can now display its category names without another query.
The biggest thing to remember is that the key/value objects embedded in your Item that "represent" a Category do not have to match the real Category document model... It is not OOP or relational database modeling.
2. Categories
In reverse you might choose to leave embedding one-way, and not have any Item info in your Category documents... or you might choose to add a collection for Item data much like above (ObjectId, or ObjectId + Name)...
In this direction, I would personally lean toward having nothing embedded... more than likely if I want Item information for my category, i want lots of it, more than just a name... and deep-embedding a top-level document (Item) makes no sense. I would simply resign myself to querying the database for an Items collection where each one possesed the ObjectId of my Category in its collection of Categories.
Phew... confusing for sure. The point is, you will have some data duplication and you will have to tweak your models to your usage for best performance. The good news is that that is what MongoDb and other document databases are good at...

Why don't use the opposite design ?
You are storing items and item_groups. If your first idea to store items in item_group entries then maybe the opposite is not a bad idea :-)
Let me explain:
in each item you store the groups it belongs to. (You are in NOSql, data duplication is ok!)
for example, let's say you store in item entries a list called groups and your items look like :
{ _id : ....
, name : ....
, groups : [ ObjectId(...), ObjectId(...),ObjectId(...)]
}
Then the idea of map reduce takes a lot of power :
map = function() {
this.groups.forEach( function(groupKey) {
emit(groupKey, new Array(this))
}
}
reduce = function(key,values) {
return Array.concat(values);
}
db.runCommand({
mapreduce : items,
map : map,
reduce : reduce,
query : {_id : {$in : [...,....,.....] }}//put here you item ids
})
You can add some parameters (finalize for instance to modify the output of the map reduce) but this might help you.
Of course you need to have another collection where you store the details of item_groups if you need to have it but in some case (if this informations about item_groups doe not exist, or don't change, or you don't care that you don't have the most updated version of it) you don't need them at all !
Does that give you a hint about a solution to your problem ?

Related

How to design product category and products models in mongodb?

I am new to mongo db database design,
I am currently designing a restaurant products system, and my design is similar to a simple eCommerce database design where each productcategory has products so in a relational database system this will be a one (productcategory) to many products.
I have done some research I do understand that in document databses.
Denationalization is acceptable and it results in faster database reads.
therefore in a nosql document based database I could design my models this way
//product
{
name:'xxxx',
price:xxxxx,
productcategory:
{
productcategoryName:'xxxxx'
}
}
my question is this, instead of embedding the category inside of each product, why dont we embed the products inside productcategory, then we can have all the products once we query the category, resulting in this model.
//ProductCategory
{
name:'categoryName',
//array or products
products:[
{
name:'xxxx',
price:xxxxx
},
{
name:'xxxx',
price:xxxxx
}
]
}
I have researched on this issue on this page http://www.slideshare.net/VishwasBhagath/product-catalog-using-mongodb and here http://www.stackoverflow.com/questions/20090643/product-category-management-in-mongodb-and-mysql both examples I have found use the first model I described (i.e they embed the productCategory inside product rather than embed array of products inside productCategory), I do not understand why, please explain. thanks
For the DB design, you have to consider the cardinality (one-to-few, one-to-many & one-to-gazillions) and also your data access patterns (your frequent queries, updates) while designing your DB schema to make sure that you get optimum performance for your operations.
From your scenario, it looks like each Product has a category, however it also looks like you need to query to find out Products for each category.
So, in this case, you could do with something like :
Product = { _id = "productId1", name : "SomeProduct", price : "10", category : ObjectId("111") }
ProductCategory = { _id = ObjectId("111"), name : "productCat1", products : ["productId1", productId2", productId3"]}
As i said about data access patterns, if you always read the Category-name and "Category-name" is something which is very infrequently updated then you can go for denormalizing with this two-way referencing by adding the product-category-name in the product:
Product = { _id = "productId1", name : "SomeProduct", price : "10", category : { ObjectId("111"), name:"productCat1" }
So, with embedding the documents, specific queries would be faster if no joins would be required, however other queries in which you need to access embedded details as stand-alone entities would be difficult.
This is a link from MongoDB which explains DB design for one-to-many scenario like you have with very nice examples in which you would realize that there are much more ways of doing it and many other things to think about.
http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 (also has links for parts 1 & 2)
This link also describes the pros & cons for each scenario enabling you to reach a narrow down on a DB schema design.
It all depends on what queries you have in mind.
Suppose a product can only belong to one category, and one category applies to many products. Then, if you expect to retrieve the category together with the product, it makes sense to store it directly:
// products
{_id:"aabbcc", name:"foo", category:"bar"}
and if you expect to query all the products in a given category then it makes sense to create a separate collection
// categories
{_id:"bar", products=["aabbcc"]}
Remember that you cannot atomically update both the products and categories database (MongoDB is eventually consistent), but you can occasionally run a batch job which will make sure all categories are up-to-date.
I would recommend to think in terms of what kind of information will I often need, as opposed to how to normalize/denormalize this data, and make your collections reflect what you actually want.

MongoDB schema design for unbounded growing table

I'm practicing on MongoDB through a small personal project,
in which, may encounter a need to store some intermediate data abstracted as a unbounded growing table. Both rows and columns would grow boundlessly.
The usage of this abstract table is that I want to be able to
know the corresponding column for each entry in a row
know the corresponding row for each entry in a column
Or, in other word, know the index of each table entry
Hence there comes up two choices to model the table:
Make two collections:
one holds each row as a document which embeds a growing structure as row entries to have reference to the corresponding columns;
and similarly, another collection holds each column as a document embedding a growing structure to reference to the corresponding rows.
Make a single separate collection that holds each table entry as a document. Hence each document size is fixed.
The first model has problem with document growth (In fact, in my application, the table grows a bit askew, and only one collection would encounter document growth issue). The second model seems fine to me. Is there some pitfall or some other issue that should be aware of? And what is the common practice to deal with such problem?
UPDATE: explain things in more detail
I am trying to do automatic summarization of an ongoing conversation. The input is a corpus of sentences, and terms are extracted from each sentences. For example, English terms are stemmed, and sentences in CJK languages are segmented. Hence obtained a term-sentence matrix. Then one of the method needs to compute (sparse) SVD of such term-sentence matrix.
The sentences and extracted terms would be stored into the database. But the term-sentence matrix would grow unbounded.
(Or one can think of the problem of storing a mapping between tweets and hashtags)
There were two choices of draft schema that comes up to my mind:
choice one (hold two-way linkages between sentences and terms)
{ // sentence collection doc
"_id" : // generated by timestamp
, "text" : //
, "contained_terms" : [
// an array of "_id"s in term collection
]
}
{ // term collection doc
"_id" : // use term name
, "in_sentences" : [
// an array of "_id"s in sentence collection
]
}
choice two (make linkages into a separate collection)
{ // linkage collection doc (as matrix entries)
"_id" : // generated by timestamp
, "term" : // an "_id" in term collection
, "in_sentence" : // an "_id" in sentence collection
}
{ // sentence collection doc
"_id" : // generated by time stamp
, "text" : //
}
{ // term collection doc
"_id" : // use term name
}
The choice one encounters document growth problem because "in sentences" array of a term collection doc is very likely to grow beyond limit when sentences come in nonstop.
The choice two extract the linkage between terms and sentences into a separate collection, hence avoids the document growth. Although querying "which sentences contain the term" costs more, but in the end, it seems I don't actually need such operation much.
Currently, I'm thinking that the choice two better suit my needs. The linkage collection seems conform to the input of sparse SVD. To speed up computation, very high frequency terms can be filtered out if the term frequency field is added to each term collection docs (or in a separate collection when there are more than one conversations). This filtering seems fine in the case of automatic summarization.
But still wonder
Is there some issues or pitfalls that should be aware of?
What is the common practice for similar situation?
My understanding of mongodb is that you need to design your schema around your queries. So how you save your data is highly dependent on what data will you be querying. So even for the same set of data, your schema can vary depending on the actual use case. Additionally, data redundancy is quite common in NoSql database design. In case you are going to need some data again and again, there is no point in saving it in a separate collection. You can duplicate it in 2 collections, and that's a fair enough cost for faster querying. Memory is cheap, processing isn't! Additionally, pre-aggregation helps in case of mongo for huge data sets. Your queries will work fine for decent number of documents, but once you go into the realm of millions of records, you may face problems with a certain class of queries like counts, aggregation, etc. Pre-aggregation helps in keeping things real time, though it may have a higher write/insertion overhead. Always avoid a full table scan, whenever you can.
Above are some broad level concepts that I find relevant to your question. I'll try and explain it in your context with some examples (as I am not sure what data you are eventually going to need, or the queries you will do).
Let's say you are going to need terms per sentence frequently, to highlight them. In that case the recommended schema will be:
{ "_id" : // sentence id - you will query on this
, "text" : // sentence text
, "terms" : ["term1", "term2", "term3"]
}
So for each new sentence, you extract all the terms and save it (not the id) along with the sentence. The advantage here being that you will not need to query for the term separately. You can get all the terms for a given sentence in a single query. Additionally, the document size doesn't grow, and hence no document relocation.
Let's say you also want to have a unique list of terms and some per term meta data. You can have a separate terms collection which has a list of all the unique terms:
{ "_id": ,
, "term": //term
, "meaning":
, "metadata""
, "count": 1
}
You can have a unique index on term. Each time you extract terms from a sentence, you look up for it in this collection, and in case you don't find it, you insert it. Now let's say you also want to maintain a count of term appearance. So each time you find a term in a sentence and do a lookup in terms collection, you can increment (atomic) the count as well - pre-aggregation. If you add an index on count, you can get the top 100 terms, etc. easily on the fly.
Now let's say you want to query/count all the sentences with a given term. You can add an index on terms array and directly look up for all the sentences with a given term:
Sentence.where(:term => "term1").count //mongoid query
Again, you are achieving this with a single query, as opposed to getting a term id first in your case, and then the sentences.
Other than this it's always advisable to ensure that your working set and indexes fit into RAM for best performance.
So again, there are no right and wrong answers for schema design and it definitely depends on the queries you will be doing. I would also advise you to unlearn some of your relational DB concepts when trying to design for NoSQL databases. I learned it the hard way =) Hope some of this helps you in coming up with an efficient schema for your use case.
If you are trying to model a matrix with the whole collection representing the matrix, I think the go-to model should be to have each entry (row i, column j) as a document. If you put in a field like "index" : { "row" : i, "column" : j} and appropriate indices then it's easy and fast to do fun things like
get the entry at (i, j)
get row i
get column j
The matrix is represented sparsely so if row i only has 10 columns with values, row i is just 10 documents. If the rows/columns really do grow unboundedly to very large sizes then modeling a document as a row or column or something of "1 dimension" could hit the hard 16MB BSON document size limit.
I'm thinking the biggest drawback could be large index sizes given that every entry is its own document.

mongodb - one document in many collections

I'am looking for a good solution on MongoDB for this problem:
There are some Category's and every Category has X items.
But some items can be in in "many" Category's!
I was looking for something like a symbolic link on Unix systems but I could't not find it.
What i thought is a good idea is:
"Category1/item1" is the Object and "category2/item44232" is only a reference to "item1" so when i change "item1" it also changes "item44232".
I looked into the MongoDB Data models documentation but there is no real solution for this.
Thank you for your response !
In RDBMSs, you use a join table to represent one-to-many relationships; in MongoDB, you use array keys. For example each product contains an array of category IDs, and both products and categories get their own collections. If you have two simple category documents
{ _id: ObjectId("4d6574baa6b804ea563c132a"),
title: "Epiphytes"
}
{ _id: ObjectId("4d6574baa6b804ea563c459d"),
title: "Greenhouse flowers"
}
then a product belonging to both categories will look like this:
{ _id: ObjectId("4d6574baa6b804ea563ca982"),
name: "Dragon Orchid",
category_ids: [ ObjectId("4d6574baa6b804ea563c132a"),
ObjectId("4d6574baa6b804ea563c459d") ]
}
for more: http://docs.mongodb.org/manual/reference/database-references/
Try looking at the problem inside-out: Instead of having items inside categories, have the items list the categories they belong into.
You'll be able to easily find all items that belong to a category (or even multiple categories), and there is no duplication nor any need to keep many instances of the same item updated.
This can be very fast and efficient, especially if you index the list of categories. Check out multikey indexes.

How to annotate the result documents with a count of child documents?

I have a collection of categories, each category document containing a link to its parent (except the root categories). Pretty simple so far.
I want to list the categories, and add a subcategory_count field to every document with the count of direct descendants.
How should I go about doing this? Could Map/Reduce be of use?
There are no "calculated columns" in MongoDB, so you can't select data and count subdocuments at the same time.
This is also the reason why most people store array length along with the array.
{friends_list: [1, 3, 234, 555],
friends_count: 4}
This helps for easier retrieval, filtering, sorting, etc. But it requires a little bit more of manual work.
So, you are basically limited to these options:
Store everything in one document.
Store subcategory count in the category.
Count subcategories on the client-side.
find() to get all of them, count the number of subcategories on each level and then update().
But it seems like your domain object should be doing this for you, so you end up with one category object that can contain categories (which could also contain categories...), and hence one mongodb document (you're listing all of them anyway, so it makes sense to retrive the whole thing in one query).

MongoDB: Speed of field ("inside record") search in comporation with speed of search in "global scope"

My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?
First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.
I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query