Document-databse implementaion of one-to-many relationship - mongodb

Designing data intensive application by Martin Kleppmann says one-to-many relationship is implemented in document database using tree hierarchy. For example as given in code below user can hold multiple positions and that can be represented as list of designation held and organization. But Organization will be duplicate. So data can't be normalized as in RDBMS. The book further says many-to-one relation can't be represented in document DB as it doresn't support join. But it can't support one-to-many also contrary to its claim because if data is stored in separate entity then document DB can't represent as whole as given below in JSON representation. So if we say document database supports one-to-many relationship then it can't do so without storing duplicate information?
{
"user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
"region_id": "us:91",
"industry_id": 131,
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"},
{"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
],
"education": [
{"school_name": "Harvard University", "start": 1973, "end": 1975},
{"school_name": "Lakeside School, Seattle", "start": null, "end": null}
],
"contact_info": {
"blog": "http://thegatesnotes.com",
"twitter": "http://twitter.com/BillGates"
}
}

Related

Which is the best design for a MongoDB database model?

I feel like the MVP of my current database needs some design changes. The number of users is growing quite fast and we are having bad performances in some requests. I also want to get rid of all the DBRef we used.
Our current model can be summarized as follow :
A company can have multiple employees (thousands)
A company can have multiple teams (hundreds)
An employee can be part of a team
A company can have multiple devices (thousands)
An employee is affected to multiple devices
Our application displays in different pages :
The company data
The users
The devices
The teams
I guess I have different options, but I'm not familiar enough with MongoDB to make the best decision.
Option 1
Do not embed and use list of ids for one to many relationships.
// Company document
{
"companyName": "ACME",
"users": [ObjectId(user1), ObjectId(user2)],
"teams": [ObjectId(team1), ObjectId(team2)],
"devices": [ObjectId(device1), ObjectId(device2)]
}
// User Document
{
"userName": "Foo",
"devices": [ObjectId(device2)]
}
// Team Document
{
"teamName": "Foo",
"users": [ObjectId(user1)]
}
// Device Document
{
"deviceName": "Foo"
}
Option 2
Embed data and duplicate informations.
// User Document
{
"companyName": "ACME",
"userName": "Foo",
"team": {
"teamName": "Foo"
},
"device": {
"deviceName": "Foo"
}
}
// Team Document
{
"teamName": "Foo"
"companyName": "ACME",
"users": [
{
"userName": "Foo"
}
]
}
// Device Document
{
"deviceName": "Foo",
"companyName": "ACME",
"user": {
"userName": "Foo"
}
}
Option 3
Do not embed and use id for one to one relationship.
// Company document
{
"companyName": "ACME"
}
// User Document
{
"userName": "Foo",
"company": ObjectId(company),
"team": ObjectId(team1)
}
// Team Document
{
"teamName": "Foo",
"company": ObjectId(company)
}
// Device Document
{
"deviceName": "Foo",
"company": ObjectId(company),
"user": ObjectId(user1)
}
MongoDB recommends to embed data as much as possible but I don't think it can be possible to embed all data in the company document. A company can have multiple devices or users and I believe it can grow too big.
I'm switching from SQL to NoSQL and I think I haven't figured it out by myself yet !
Thanks !
MongodB provides you with a feature which is handling unstructured data.
Every database can contain collection which in turn can contain documents.
Moreover, you cannot use joins in mongodB. So, storing information in one company model is a better choice because you wont be needed join in that scenario.
One more thing, You dont need to embed all the models For example : You can get user and device both from company table, so why embedding users and device as well?

Generate a JSON schema from an existing MongoDB collection

I have a MongoDB collection that contains a lot of documents. They are all roughly in the same format, though some of them are missing some properties while others are missing other properties. So for example:
[
{
"_id": "SKU14221",
"title": "Some Product",
"description": "Product Description",
"salesPrice": 19.99,
"specialPrice": 17.99,
"marketPrice": 22.99,
"puchasePrice": 12,
"currency": "USD",
"color": "red",
},
{
"_id": "SKU14222",
"title": "Another Product",
"description": "Product Description",
"salesPrice": 29.99,
"currency": "USD",
"size": "40",
}
]
I would like to automatically generate a schema from the collection. Ideally it would not which properties are present in all the documents and mark those as required. Detecting unique columns would also be nice, though not really all that necessary. In any event I would be modifying the schema after it's automatically generated.
I've noticed that there are tools that can do this for JSON. But short of downloading the entire collection as JSON, is it possible to do this using the MongoDb console or a CLI tool directly from the collection?
You could try this tool out. It appears to do exactly what you want.
Extract (and visualize) schema from Mongo database, including foreign
keys. Output is simple json file or html with dagre/d3.js diagram
(depending on command line options).
https://www.npmjs.com/package/extract-mongo-schema

MongoDB - how to properly model relations

Let's assume we have the following collections:
Users
{
"id": MongoId,
"username": "jsloth",
"first_name": "John",
"last_name": "Sloth",
"display_name": "John Sloth"
}
Places
{
"id": MongoId,
"name": "Conference Room",
"description": "Some longer description of this place"
}
Meetings
{
"id": MongoId,
"name": "Very important meeting",
"place": <?>,
"timestamp": "1506493396",
"created_by": <?>
}
Later on, we want to return (e.g. from REST webservice) list of upcoming events like this:
[
{
"id": MongoId(Meetings),
"name": "Very important meeting",
"created_by": {
"id": MongoId(Users),
"display_name": "John Sloth",
},
"place": {
"id": MongoId(Places),
"name": "Conference Room",
}
},
...
]
It's important to return basic information that need to be displayed on the main page in web ui (so no additional calls are needed to render the table). That's why, each entry contains display_name of the user who created it and name of the place. I think that's a pretty common scenario.
Now my question is: how should I store this information in db (question mark values in Metting document)? I see 2 options:
1) Store references to other collections:
place: MongoId(Places)
(+) data is always consistent
(-) additional calls to db have to be made in order to construct the response
2) Denormalize data:
"place": {
"id": MongoId(Places),
"name": "Conference room",
}
(+) no need for additional calls (response can be constructed based on one document)
(-) data must be updated each time related documents are modified
What is the proper way of dealing with such scenario?
If I use option 1), how should I query other documents? Asking about each related document separately seems like an overkill. How about getting last 20 meetings, aggregate the list of related documents and then perform a query like db.users.find({_id: { $in: <id list> }})?
If I go for option 2), how should I keep the data in sync?
Thanks in advance for any advice!
You can keep the DB model you already have and still only do a single query as MongoDB introduced the $lookup aggregation in version 3.2. It is similar to join in RDBMS.
$lookup
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing. The $lookup stage does an equality match between a field from the input documents with a field from the documents of the “joined” collection.
So instead of storing a reference to other collections, just store the document ID.

Mogodb Schema design : embedded or referenced

I have a question about designing my application schema. My application handles objects, and status for these objects.
Object and status are quite small object, and in term of volume, it handles few millions of objects, each of these objects can have hundreds to ten of thousands status.
The main write operation of application is adding status (multiple times per day per object).
Read operations are various (list of object, object detail, many aggreagation framework queries to get stats about objects and status).
I try both approaches, it works pretty well for couple of hundred of obejcts with 1000-10000 status but can't figure out which is the more scalable.
Embedded objects :
Objects:
[
{
"id": "0000000001",
"name": "Resource 1",
"description": "Resource 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z",
"status": [
{
"id": "0000000001",
"position": [0, 0],
"comment": "comment 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
},
{
"id": "0000000002",
"position": [0, 0],
"comment": "comment 2",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
}
]
}
]
Referenced objects :
Objects:
[
{
"id": "0000000001",
"name": "Resource 1",
"description": "Resource 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z",
}
]
Status
[
{
"id": "0000000001",
"object_id": "0000000001",
"position": [0, 0],
"comment": "comment 1",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
},
{
"id": "0000000002",
"object_id": "0000000001",
"position": [0, 0],
"comment": "comment 2",
"owner": "John Doe",
"created_at": "2000-01-01T00:00:00.000Z",
"updated_at": "2000-01-01T00:00:00.000Z"
}
]
Best regards,
Mickael
Embedded model is preferable If you want to get maximum performance and your single document is smaller then 16 megabytes (maximum BSON document size)
Referenced model in preferable when you need maximum scalability or some complex relationships.
So, if you sure that you can fit to the document size limit, you can use embedded model. But if you want guaranteed scalability - use referenced.
Here is some quotes from Mongo documentation:
In general, use embedded data models when:
you have “contains” relationships between entities
you have one-to-many relationships between entities. In these relationships the “many” or child documents always appear with or are
viewed in the context of the “one” or parent documents.
Reference to Model One-to-Many Relationships with Embedded Documents
In general, embedding provides better performance for read operations,
as well as the ability to request and retrieve related data in a
single database operation. Embedded data models make it possible to
update related data in a single atomic write operation.
In general, use normalized data models:
when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the
implications of the duplication.
to represent more complex many-to-many relationships.
to model large hierarchical data sets.
References provides more flexibility than embedding. However,
client-side applications must issue follow-up queries to resolve the
references. In other words, normalized data models can require more
roundtrips to the server.
Reference to Model One-to-Many Relationships with Document References

MongoDB: How to represent a schema diagram in a thesis?

I am currently writing a thesis and need to display the schema of my MongoDB in a diagram. I have found no resources about diagrams for document-based databases.
There are Entity Relationship Diagrams (ERD) for relational databases. What options do I have for MongoDB? I've noticed that a lot of blogs just display the raw JSON as their "diagram" but this isn't feasible in my thesis.
Here is a sample of one of my JSON structures:
//MultiChoiceQuestion
{
"title": "How are you?",
"valid_answers" : [
{
"_id" : ObjectID(xxxx),
"title": "Great",
"isCorrect": true,
},
{
"_id" : ObjectID(yyyy),
"title": "OK",
"isCorrect": false,
},
{
"_id" : ObjectID(zzzz),
"title": "Bad",
"isCorrect": false,
}
],
"user_responses" : [
{
"user": ObjectID(aaaa),
"answer": ObjectID(xxxx)
},
{
"user": ObjectID(bbbb),
"answer": ObjectID(xxxx)
},
{
"user": ObjectID(cccc),
"answer": ObjectID(yyyy)
}
]
}
//User
{
"_id": ObjectID(aaaa),
"name": "Person A"
}
//User
{
"_id": ObjectID(bbbb),
"name": "Person B"
}
//User
{
"_id": ObjectID(cccc),
"name": "Person C"
}
Could this be a possible diagram:
We found class diagrams to actually be one of the best ways to represent a mongo schema design.
It can capture most of the items that a document will have such as arrays, embedded objects and even references.
General guidelines we use to relate onto concepts to uml
Embed = Composition aggregation
Reference = Association class
If you're unfamiliar with the uml terminology then this is a decent intro.
UML intro from IBM site
There is a tool doing diagrams for MongoDb, is called DbSchema. It discovers the schema by scanning data from db. I would also suggest trying two features from them :
virtual relations which allow exploring data from different collections in the same time. A kind of JOIN between different collections.
HTML documentation, we use it in presentations as well - the comments are in mouse-over ( diarams are saved as vector images ).