Better mongodb data model for nested information - mongodb

I am designing some data model for mango db, I have some requirement similar to below json.
Single_Collection.
{
"collegeid": 1234,
"Name": "aaaa",
"otherinfo": 1,
"studnet":[
{
"stdid": 1,
"name": "n1"
},
{
"stdid": 2,
"name": "n2"
}
]
}
Two Collections.
College Info
{
"collegeid": 1234,
"Name": "aaaa",
"otherinfo": 1
}
Student Info collection
{
"collegeid": 1234,
"stdid": 1,
"name": "n1"
}
{
"collegeid": 1234,
"stdid": 2,
"name": "n2"
}
Which is the better way interms of reading performance(Having single collection or separating it out), I have more read like given student ID find out the college ID .
Student ID list will be very big.
Also I perform more student insertion operations

IMO,each model design has its own Pros & Cons, what we say "better way" is depending on your use cases(How you query the data? Do you need all the data at the beginning? Do you need paging? etc...)
Let's start from your requirements.
Your requirements
Given a college ID, find out the students in this college.
Given student ID, find out his college ID.
Relation between objects
Obviously college & student is 1:m mapping, because a lot of students in one college but each student can stay in one college only.
I will show you some different model designs and also provide the Pros & Cons for each model.
Approach 1 - Embed students into college
This is the design you mentioned as a single collection.
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1,
"studnet":[
{
"stdid":1,
"name":"n1"
},
{
"stdid":2,
"name":"n2"
}
]
}
Pros:
Very natural model for human to read & front-end to display.
Good performance when loading the college and all the students in it. Because the data stored in the engine is in continuous. The engine needs fewer I/O to do that.
Cons:
If you have huge number of students in a college, the size of the document will be very big. It will be inefficient if you add/remove/update student frequently.
There is not a quick way to achieve requirement(2). Since we only maintain the mapping from college -> students, you have to walk through all the documents to find out which college contains the specified studentID.
Approach 2 - Student has reference to college
This is the design you mentioned as a Two Collections. It is similar to the RDBMS tables, student model owns a reference key point to its college.
College collection:
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1
}
Student collection:
{
"collegeid":1234,
"stdid":1,
"name":"n1"
}
{
"collegeid":1234,
"stdid":2,
"name":"n2"
}
Pros:
Can achieve requirement(1) and (2). Remember to add index on "collegeid" and "stdid" field.
Every document can be maintained in small size, it is easy for the engine to store the data.
Cons:
College and students are separated. It will be slower than Approach 1 if loading a college and all its students(Need two queries).
You need to merge college and students together by yourself before displaying in UI.
Approach 3 - Duplicated data in college and students
This approach looks like we mix up approach 1 and approach 2. We have two collections: college will have its students embeded in itself, and also a separated student collection. So, the student data is duplicated in both collections.
College collection:
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1,
"studnet":[ // duplicated here!
{
"stdid":1,
"name":"n1"
},
{
"stdid":2,
"name":"n2"
}
]
}
Student collection:
{
"collegeid":1234,
"stdid":1,
"name":"n1"
}
{
"collegeid":1234,
"stdid":2,
"name":"n2"
}
Pros:
You have all Pros from Approach 1 & Approach 2.
Cons:
The document in college collection will grow to very big.
You have to keep the data from college collection and student collection Synchronous by yourself.
Approach 4 - Duplicated data in college(Only studentID) and students
This is a variant from Approach 3.
We assume that your use case is:
User can search for a college.
User click one college in search result.
A new UI show all student IDs to user(maybe in a grid or list).
User click one student ID.
System loads the full data of the specified student and show to user in another UI.
In a short word, user does not need the full data of all students at the beginning, he just need students' basic info(e.g. student ID). If user accepts such scenario, you can have below model:
College collection:
{
"collegeid":1234,
"Name":"aaaa",
"otherinfo":1,
"studnetIds":[1, 2] // only student IDs are duplicated
}
Student collection:
{
"collegeid":1234,
"stdid":1,
"name":"n1"
}
{
"collegeid":1234,
"stdid":2,
"name":"n2"
}
College only has the studnet IDs in it. This is the difference compared to Approach 3.
Pros:
Can achieve requirement(1) and (2).
You don't need to worry about the college document grows to huge size. Since it only owns the student IDs.
If user accepts above scenario, this will be a better desgin on the balance of performance/complex/data size.
Cons:
Suitable to specified use case, if the requirement is changed in the future, will break the scenario and this model will be not good.
Summary
You should be very clear to your use cases.
Based on use cases, compare the approachs to see whether you can accept the Pros & Cons.
Load testing is important!

Related

How do you store user data - in one or multiple collections?

I'm trying to store a user details (name, password, etc. ) in mongodb and I also wanted to store data on the books they've read (author, date, title), so this would be like an array of objects. Would it be better to store this in 2 different collections in the same database like this...
collection 1:
{
_id: ObjectID('SOMEid'),
firstName:"x",
lastName:"y",
email:"x#z.com",
username:"xxx",
password:"xyz"
}
collection 2:
{
_id: ObjectID('SOMEOTHERid'),
spend: [objects containing each book they've read],
userID: 'SOMEid'
}
and then reference the user in the collection containing the book data.
or in 1 collection like this...
{
_id: ObjectID('SOMEid'),
firstName:"x",
lastName:"y",
email:"x#z.com",
username:"xxx",
password:"xyz",
books: []
}
or is there another way to do it that's better. thanks
It's good to save it in 2 different collections. You can get all the user meta sata from collection for login page etc. And other collection holds user actions like which all books user has read.
Are you thinking of using other DBs? Or Mongo Db is final?

How do I get unique values from nested Many-To-Many joins

Basically we have four tables that are joined: Malls, Stores, Brands, and Categories.
Malls can have many Stores.
Each Store is linked to a Brand.
Each Brand has many Categories.
e.g. Mall A has 2 "McDonald's Cafes", each belonging to the Brand "McDonald's". "McDonald's" Brand has "Fast Food" and "Breakfast" as Categories.
We're trying to figure out how to display all the Categories that exist within Mall A.
e.g. Mall A has "Fast Food" and "Breakfast" categories, based on the "McDonald's" stores.
Ideally this would be a query so that the Categories are updated automatically.
We've tried querying for all Stores within a Mall, and then finding the Categories via the Store-Brand join, then reducing duplicates. But some Malls have more than 700 Stores, so this process is quite expensive in terms of querying and processing the data.
I managed to figure it out! Using Sequelize, my query was as follows:
postgres.BrandsCategories.findAndCountAll({
limit,
offset,
include: [
{
model: postgres.Brands,
as: "brands",
include: [
{
model: postgres.Stores,
as: "stores",
where: {
mallId: parent.dataValues.id
}
}
]
}
]
})

MongoDB , how to choose the right design?

I have to create a collection of document and I have a doubt about the right design.
Each document is an "identity"; each Identity has a list of "partner Data"; each partner data are defined by an ID and a set of Data.
One approach can be (1):
{
_id: ...
partners: [
{
id: partner1,
data: {
}
},
{
id: partner2,
data: {
}
},
]
}
Another approach can be (2)
{
_id: ...
partners: {
partner1: {
data: {
}
},
partner2: {
data: {
}
},
]
}
I prefer the first one, but considering that I could have million of these identities, which could be the most performed schema?
A typical query can be: "how many identities have partner with ID N".
With the second example, a query can be:
db.identities.find({partner.partnerName: {$exists:true}})
With first approach, how can I get this count?
The second solution is more easy to handle Server Side; each document will have a list where each KEY is the partner ID, so instead of scan all document, I can simply get partner data by key...
What do you think about these solutions? I prefer the first one but the second I think that is more "usable"...
Thanks
I prefer the first one, but considering that I could have million of these identities, wich could be the most performed schema?
If you going to have millions of identities, then both approaches
are not really scalable .
Each document in mongo has a size limit (16MB) (read about it here)
In case you are going to have really lot's of identities ,
the scalable approach would be to create a different collection,
only for the relations and partnership data.
Now , I also want you to consider how you treat "partnership",
if I'm a user and I got you on my partners list , will you see me as a partner on your list ?
In case we both see each other as partners , then mongo-db may not be the best solution. graph db's are more appropriate for dealing with relations of this type.
All solutions within mongo for two-ways relations will be built on double updates (Your id on my partner's list , My id on your partner's list).
(In SQL you could add an extra condition for joining but not in mongo) ,
so you don't need to save twice partnerships . (me and you , you and me)
just you and me.
Do you see where is going this way ?
If you need to go only in one way,
Then just create a second collection , "partnerships" ,
{
_id: should be uniqe,
user_id: 'your_id',
partner_id: 'his_id'
data: {} or just flatten the fields into the root object.
}
Please notice that you create a row for each partnership !
Then you could use $lookup in order to query for a user with all of his
partners .
something like:
db.getCollection('partners').aggregate([
{
$lookup: {
from: 'parterships',
localField: '_id',
foreignField: 'user_id',
as: 'partners'
}
},
{
$project: {
name: 1,
partners: 1,
num_partners: { $size: "$partners" }
}
}
])
Read more about the aggregation stages here.
In case you are not going to have lot's of partnership's Then please continue with
your first approach which is good .
The second approach will make most queries to this collection pretty weird and you will always have to write code in order to query this table .
It won't be "straight forward" mongo queries .

Filtering and sorting by a specified child in Firebase

I am building a game app using Firebase and Swift, and store my data like this.
players: {
"category": {
"playerId": {
name: "Joe",
score: 3
}
}
}
There are two queries I need to make based on the score:
Get the top 5 players from a single category.
Get the overall top 100 players of the game.
I have no problem getting the data for the first query, using
ref.child("players").child(category).queryOrdered(byChild: "score").queryLimited(toLast:5)
But I am having trouble figuring out the second query. How can I do this efficiently?
You'll either need to add an additional data structure to allow the second query (which is quite common in NoSQL databases) or change your current structure to allow both queries.
The latter could be accomplished by turning the data into a single flat list, with some additional synthetic properties. E.g.
players: {
"playerId": {
category: "category1"
name: "Joe",
score: 3,
category_score: "category1_3"
}
}
With the new additional properties category and category_score, you can get the top scorers for a category with:
ref.child("players")
.queryOrdered(byChild: "category_score")
.queryStarting(atValue:"category1_")
.queryEnding(atValue("category1_~")
.queryLimited(toLast:5)
And the top 100 scores overall with:
ref.child("players")
.queryOrdered(byChild: "score")
.queryLimited(toLast:100)
For more on this, see my answer here: Firebase Database queries can only order/filter on a single property. In many cases it is possible to combine the values you want to filter on into a single (synthetic) property. For an example of this and other approaches, see my answer here: http://stackoverflow.com/questions/26700924/query-based-on-multiple-where-clauses-in-firebase.

MongoDB many-to-many relationship logic

I'm trying to design a schema for Products, Suppliers and Manufacturers:
A product can have many suppliers but only 1 manufacturer.
A supplier can have many products and many manufacturers.
A manufacturer can have many suppliers and many products.
I've reviewed this page where 10gen indicates "To avoid mutable, growing arrays, store the publisher reference inside the book document". In my example I consider the product-->manufacturer relationship to be equivalent to that of book-->publisher. I would therefore do this:
{
_id: "widgets",
manufacturer_name: "Widgets Inc.",
founded: 1980,
location: "CA"
}
{
_id: 123456789,
product_name: "Steel Widget",
color: "Steel",
manufacturer_id: "widgets"
}
{
_id: 223456789,
product_name: "White Widget",
color: "White",
manufacturer_id: "widgets"
}
What is the best way to handle the SUPPLIER (with relationships to many products and many manufacturers) such that I "avoid mutable, growing arrays"??
Note: This is one way to model it. Data modeling has a lot to do with the use cases and the according questions you want to ask. Your use case might need a different model.
I'd probably model it like this
manufacturer
{
_id:"ACME",
name: "ACME Corporation"
…
}
products
{
_id:ObjectId(...),
manufacturer: "ACME",
name: "SuperFoo",
description: "Without SuperFoo, you can't bar or baz!",
…
}
Now comes the problem. Since potentially, if we embed all the products in a supplier document or vice versa, we could break the 16MB size limit easily, I'd use a different approach:
Supplier:
{
_id:ObjectId(...),
"name": "FooMart",
"location: { city:"Cologne",state:"MN",country:"US",geo:[44.770833,-93.783056]}
}
productSuppliers:
{
_id:ObjectId(...),
product:ObjectId(...),
supplier:ObjectId(...)
}
This way, each product can have gazillions of suppliers. Now here come the drawbacks.
For finding a supplier for a certain product, you have to do a two step query. First, you have to find all supplier IDs for a given product:
db.productSuppliers.find({product:<some ObjectId>},{_id:0,supplier:1})
Now, let's say we want to find all suppliers of SuperFoo near Cologne,MN at a max distance of 10 miles:
db.suppliers.find({
_id:{$in:[<ObjectIds of suppliers from the first query>]},
"location.geo": { $near :
{
$geometry: { type: "Point", coordinates: [ 44.770833, -93.783056] },
$maxDistance: 16093
}
}
})
You need to do quite some smart indexing to make these queries efficient. Which indices you need depends on your use case. The problem with indices is that they are only efficient when kept in RAM. So you really should be careful when creating your indices.
Again: The way you model data heavily depends on your use cases. In case you only have a few products per manufacturer or only a few suppliers per product or if each supplier only supplies products by a certain manufacturer, the model may look quite different.
Please have a close look at MongoDB's data modeling documentation. Most problems with MongoDB stem from wrong data modeling for the respective use case.