I'm trying to design a schema for Products, Suppliers and Manufacturers:
A product can have many suppliers but only 1 manufacturer.
A supplier can have many products and many manufacturers.
A manufacturer can have many suppliers and many products.
I've reviewed this page where 10gen indicates "To avoid mutable, growing arrays, store the publisher reference inside the book document". In my example I consider the product-->manufacturer relationship to be equivalent to that of book-->publisher. I would therefore do this:
{
_id: "widgets",
manufacturer_name: "Widgets Inc.",
founded: 1980,
location: "CA"
}
{
_id: 123456789,
product_name: "Steel Widget",
color: "Steel",
manufacturer_id: "widgets"
}
{
_id: 223456789,
product_name: "White Widget",
color: "White",
manufacturer_id: "widgets"
}
What is the best way to handle the SUPPLIER (with relationships to many products and many manufacturers) such that I "avoid mutable, growing arrays"??
Note: This is one way to model it. Data modeling has a lot to do with the use cases and the according questions you want to ask. Your use case might need a different model.
I'd probably model it like this
manufacturer
{
_id:"ACME",
name: "ACME Corporation"
…
}
products
{
_id:ObjectId(...),
manufacturer: "ACME",
name: "SuperFoo",
description: "Without SuperFoo, you can't bar or baz!",
…
}
Now comes the problem. Since potentially, if we embed all the products in a supplier document or vice versa, we could break the 16MB size limit easily, I'd use a different approach:
Supplier:
{
_id:ObjectId(...),
"name": "FooMart",
"location: { city:"Cologne",state:"MN",country:"US",geo:[44.770833,-93.783056]}
}
productSuppliers:
{
_id:ObjectId(...),
product:ObjectId(...),
supplier:ObjectId(...)
}
This way, each product can have gazillions of suppliers. Now here come the drawbacks.
For finding a supplier for a certain product, you have to do a two step query. First, you have to find all supplier IDs for a given product:
db.productSuppliers.find({product:<some ObjectId>},{_id:0,supplier:1})
Now, let's say we want to find all suppliers of SuperFoo near Cologne,MN at a max distance of 10 miles:
db.suppliers.find({
_id:{$in:[<ObjectIds of suppliers from the first query>]},
"location.geo": { $near :
{
$geometry: { type: "Point", coordinates: [ 44.770833, -93.783056] },
$maxDistance: 16093
}
}
})
You need to do quite some smart indexing to make these queries efficient. Which indices you need depends on your use case. The problem with indices is that they are only efficient when kept in RAM. So you really should be careful when creating your indices.
Again: The way you model data heavily depends on your use cases. In case you only have a few products per manufacturer or only a few suppliers per product or if each supplier only supplies products by a certain manufacturer, the model may look quite different.
Please have a close look at MongoDB's data modeling documentation. Most problems with MongoDB stem from wrong data modeling for the respective use case.
Related
Basically we have four tables that are joined: Malls, Stores, Brands, and Categories.
Malls can have many Stores.
Each Store is linked to a Brand.
Each Brand has many Categories.
e.g. Mall A has 2 "McDonald's Cafes", each belonging to the Brand "McDonald's". "McDonald's" Brand has "Fast Food" and "Breakfast" as Categories.
We're trying to figure out how to display all the Categories that exist within Mall A.
e.g. Mall A has "Fast Food" and "Breakfast" categories, based on the "McDonald's" stores.
Ideally this would be a query so that the Categories are updated automatically.
We've tried querying for all Stores within a Mall, and then finding the Categories via the Store-Brand join, then reducing duplicates. But some Malls have more than 700 Stores, so this process is quite expensive in terms of querying and processing the data.
I managed to figure it out! Using Sequelize, my query was as follows:
postgres.BrandsCategories.findAndCountAll({
limit,
offset,
include: [
{
model: postgres.Brands,
as: "brands",
include: [
{
model: postgres.Stores,
as: "stores",
where: {
mallId: parent.dataValues.id
}
}
]
}
]
})
I am designing some data model for mango db, I have some requirement similar to below json.
Single_Collection.
{
"collegeid": 1234,
"Name": "aaaa",
"otherinfo": 1,
"studnet":[
{
"stdid": 1,
"name": "n1"
},
{
"stdid": 2,
"name": "n2"
}
]
}
Two Collections.
College Info
{
"collegeid": 1234,
"Name": "aaaa",
"otherinfo": 1
}
Student Info collection
{
"collegeid": 1234,
"stdid": 1,
"name": "n1"
}
{
"collegeid": 1234,
"stdid": 2,
"name": "n2"
}
Which is the better way interms of reading performance(Having single collection or separating it out), I have more read like given student ID find out the college ID .
Student ID list will be very big.
Also I perform more student insertion operations
IMO,each model design has its own Pros & Cons, what we say "better way" is depending on your use cases(How you query the data? Do you need all the data at the beginning? Do you need paging? etc...)
Let's start from your requirements.
Your requirements
Given a college ID, find out the students in this college.
Given student ID, find out his college ID.
Relation between objects
Obviously college & student is 1:m mapping, because a lot of students in one college but each student can stay in one college only.
I will show you some different model designs and also provide the Pros & Cons for each model.
Approach 1 - Embed students into college
This is the design you mentioned as a single collection.
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1,
"studnet":[
{
"stdid":1,
"name":"n1"
},
{
"stdid":2,
"name":"n2"
}
]
}
Pros:
Very natural model for human to read & front-end to display.
Good performance when loading the college and all the students in it. Because the data stored in the engine is in continuous. The engine needs fewer I/O to do that.
Cons:
If you have huge number of students in a college, the size of the document will be very big. It will be inefficient if you add/remove/update student frequently.
There is not a quick way to achieve requirement(2). Since we only maintain the mapping from college -> students, you have to walk through all the documents to find out which college contains the specified studentID.
Approach 2 - Student has reference to college
This is the design you mentioned as a Two Collections. It is similar to the RDBMS tables, student model owns a reference key point to its college.
College collection:
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1
}
Student collection:
{
"collegeid":1234,
"stdid":1,
"name":"n1"
}
{
"collegeid":1234,
"stdid":2,
"name":"n2"
}
Pros:
Can achieve requirement(1) and (2). Remember to add index on "collegeid" and "stdid" field.
Every document can be maintained in small size, it is easy for the engine to store the data.
Cons:
College and students are separated. It will be slower than Approach 1 if loading a college and all its students(Need two queries).
You need to merge college and students together by yourself before displaying in UI.
Approach 3 - Duplicated data in college and students
This approach looks like we mix up approach 1 and approach 2. We have two collections: college will have its students embeded in itself, and also a separated student collection. So, the student data is duplicated in both collections.
College collection:
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1,
"studnet":[ // duplicated here!
{
"stdid":1,
"name":"n1"
},
{
"stdid":2,
"name":"n2"
}
]
}
Student collection:
{
"collegeid":1234,
"stdid":1,
"name":"n1"
}
{
"collegeid":1234,
"stdid":2,
"name":"n2"
}
Pros:
You have all Pros from Approach 1 & Approach 2.
Cons:
The document in college collection will grow to very big.
You have to keep the data from college collection and student collection Synchronous by yourself.
Approach 4 - Duplicated data in college(Only studentID) and students
This is a variant from Approach 3.
We assume that your use case is:
User can search for a college.
User click one college in search result.
A new UI show all student IDs to user(maybe in a grid or list).
User click one student ID.
System loads the full data of the specified student and show to user in another UI.
In a short word, user does not need the full data of all students at the beginning, he just need students' basic info(e.g. student ID). If user accepts such scenario, you can have below model:
College collection:
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1,
"studnetIds":[1, 2] // only student IDs are duplicated
}
Student collection:
{
"collegeid":1234,
"stdid":1,
"name":"n1"
}
{
"collegeid":1234,
"stdid":2,
"name":"n2"
}
College only has the studnet IDs in it. This is the difference compared to Approach 3.
Pros:
Can achieve requirement(1) and (2).
You don't need to worry about the college document grows to huge size. Since it only owns the student IDs.
If user accepts above scenario, this will be a better desgin on the balance of performance/complex/data size.
Cons:
Suitable to specified use case, if the requirement is changed in the future, will break the scenario and this model will be not good.
Summary
You should be very clear to your use cases.
Based on use cases, compare the approachs to see whether you can accept the Pros & Cons.
Load testing is important!
I come from SQL relation world, and have a question that involves mongodb schema design.
The reality that i need to represent contains:
Users, and monthly reports (with multiple daily reports).
I want to figure out if in mongodb, is better to embedded reports objects into the Users collection, or to have 2 separate collection referenced by id.
Embedded solution:
User:{
name:
surname:
monthlyReports: [{
month: "January 2014"
dailyReport: [{
day: 1
singleReport: [
{ report1}, {report2}, ...
]
}, {
day: 2
singleReport: [
{ report1}, {report2}, ...
]
}
]
},
{
/*
february 2014
Day 1
Day 2 ...
*/
} ...
]
}
Referenced solution:
Users:{
name:
surname:
monthlyReports: [
id_reportMonth1, id_reportMonth2, ...
]
}
MonthlyReport: {
id:
month:
dailyReport: [{
day: 1
singleReport: [
{ report1 }, { report2 } ...
]
},
{
} ....
]
}
For single user, i need to retrive single daily report, monthly report, and total report.
I think that in embedded solution, querying is more simple, but create large object in a long period.
Another possibility:
Create 3 referenced collection:
User, monthlyReport, dailyReport.
What is the better way to do it?
Someone has suggested?
Mongo wrote an excellent 3 parts blog post about it: http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
Bottom line: it depends on the use.
You need to consider two factors (ref: mongodb blog post) :
Will the entities on the “N” side of the One-to-N ever need to stand alone?
What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?
Based on these factors, you can pick one of the three basic One-to-N schema designs:
Embed the N side if the cardinality is one-to-few and there is no
need to access the embedded object outside the context of the parent
object
Use an array of references to the N-side objects if the cardinality
is one-to-many or if the N-side objects should stand alone for any
reasons
Use a reference to the One-side in the N-side objects if the
cardinality is one-to-squillions
Hope this helps.
There are several collections, i.e. Country, Province, City, Univ.
Just like in the real world, every country has several provinces, and every province has several cities, and every city has several universities.
How can I know whether a university is in the given country?For example, country0 may have some universities, what are their _ids?
Documents in those collections are showed below:
{
_id:"country0",
provinces:[
{
$ref:"Province",
$id:"province0"
},
...
]
}
{
_id:"province0",
belongs:{$ref:"Country", $id:"country0"},
cities:[
{
$ref:"City",
$id:"city0"
}
...
]
}
{
_id:"city0",
belongsTo:{$ref:"Province",$id:"province0"},
univs:[
{
$ref:"Univ",
$id:"univ0"
}
...
]
}
{
_id:"univ0",
address:{$ref:"City", $id:"city0"}
}
If there are only two collections, I know fetch() may be useful.
Also, python drivers may be useful, but I can't know their performance well, because I can't use db.system.profile in a .py file.
MongoDB doesn't do joins. N queries are required to get information from N collections. In this situation, to get the _id's of universities in a given country in an array one could do the following (in the mongo shell):
> var country = db.countries.findOne({ "_id": "country0" });
> var province_ids = [];
> country.provinces.forEach(function(province) { province_ids.push(province["$id"]); });
> var provinces = db.provinces.find({ "_id": { "$in": province_ids });
> var city_ids = [];
> provinces.forEach(function(province) { province.cities.forEach(function(city) { city_ids.push(city["$id"]); }); });
> var cities = db.cities.find({ "_id": { "$in": city_ids } });
> univ_ids = [];
> cities.forEach(function(city) { city.univs.forEach(function(univ) { univ_ids.push(univ["$id"]); }); });
It's also possible to accomplish the same thing using the belongsTo field, using similar steps. This is cumbersome and it seems like there should be a better way. There is! Normalize the data. Countries have provinces, which have cities, which have universities, but the relationships are fixed and not of huge cardinality. For doing queries like "what universities are in a given country?" I would suggest storing province documents entirely within countries and university documents entirely within city documents. You could store cities inside of province documents, or inside country documents directly, but a province or country could have hundreds or thousands of cities and this might be too much information for one document (16MB limit per document in MongoDB). Having provinces in countries and universities in cities reduces the number of queries necessary to two.
Another option is to store more information in each child document. Essentially you have a forest (a collection of trees): countries are parents of provinces which are parents of cities which are parents of universities. The belongsTo field is a parent reference. You could store a reference to all ancestors instead of just the parent. Then finding all universities in a given country is one query on the universities collection.
> db.universities.findOne();
{
_id: "univ0",
city: "city0",
province: "province0",
country: "country0"
}
> db.universities.find({ "country": "country0" });
The schema design that is best for you depends on the types of queries your application will need to answer and their relative frequencies and importance. I can't determine that from your question so I can't firmly recommend one schema over another.
As to your mini-question about performance and the db.system.profile collection, note that db.system.profile is a collection. You can query it from a .py file using a driver.
I am having some difficulty structuring my data so that I can benefit from reactivity in Meteor. Mainly nesting arrays of objects makes queries tricky.
The three main views I am trying to project this data onto are
waiter: shows order for one table, each persons meal (items nested, essentially what I have below)
kitchen manager: columns of orders by table (only needs table, note, and the items)
cook: columns of items, by category where started=true (only need item info)
Currently I have a meteor collection of order objects like this:
Order {
table: "1",
waiter: "neil",
note: "note from kitchen",
meals: [
{
seat: "1",
items: [ {n: "potato", category: "fryer", started: false },
{n: "water", category: "drink" }
]
},
{
seat: "2",
items: [ {n: "water", category: "drink" } ]
},
]
}
Is there any way to query inside the nested array and apply some projection, or do I need to look at an entirely different data model?
Assuming you're building this app for one restaurant, there shouldn't be many active orders at any given time—presumably you don't have thousands of tables. Even if you want to keep the orders in the database after they're served, you could add a field active to separate out the current ones.
Then your query is simple: activeOrders = Orders.find({active: true}).fetch(). The fetch returns an array, which you could loop through several times for each of your views, using nested if and for loops as necessary to dig down into the child objects. See also Underscore's _.pluck. You don't need to get everything right with some complicated Mongo query, and in fact your app will run faster if you query once and process the results however many times you need to.