Structuring mogodb schema - mongodb

I'm playing around with MongoDB and was wondering what best practices are for how a SQL-ish schema may correspond to MongoDB. Here are the tables/data I have so far:
user
id
email
name
answer
user_id (FK user.id)
tag
upvotes
repo
id
owner
name
description
stars
repo_tag
repo_id (FK to repo.id)
tag
is_language
percentage
repo_contrib
repo_id (FK to repo.id)
user_id (FK to user.id)
lines_of_code
The structure goes something like this:
user
answer (left outer)
repo_contrib (left outer)
repo
repo_tag
Note: All users will have at least one answer or one repo, but does not necessarily have to have both.
How might I put this into a mongo schema? Would this be one 'collection' ? Or would this be two collections: one for user, and one for repo; or more?
My queries will be something like: "Grab all users with an Answer with tay [Python] with more than 2 upvotes or a repo with the [Python] tag with more than two stars.

Let me divide this to couple of steps:
STEP 1 - MONGODB and MONGOOSE
MongoDB is a document based database. Each record in a collection is a document, and every document should be self-contained (it should contain all information that you need inside it).
Since MongoDB is a no-relation database, you can not create relations between collections, but you can store a reference of one collection document as a property of another collection document. To help you manage all of this, there is a great package called Mongoose, which will allow you to create a Model for each Collection. After you define
Models, Mongoose will allow you to easily make queries to database.
STEP 2 - DEFINING MODELS
As we said, documents should be self-contained, so they should have all information that you need inside them. We can have 2 approaches based on your example:
APPROACH 1:
Create one collection for each table that you have in your relational database. This is the best practice when you have documents with a lot of data, because it is scalable.
APPROACH 2:
Create 3 Collections - USERS, ANSWERS and REPOS. Because repo_contrib does not have a lot of data, you can store all user's contributions in a USERS document. That way, when you fetch a User document, you will have everything that you need in one place. The same goes for repo_tag - we can store all repo's tags in a REPOS document.
APPROACH 3:
Create 2 Collections - USERS and REPOS. The same as APPROACH 2, but you can also add all user's answers to the USERS document.
RECOMMENDATION:
I would go with APPROACH 2 in this case, since repo_contrib and repo_tag does not store big data and can easily be stored in USERS and REPOS documents with no problem. Also, if we go with this approach, it will make querying database a lot easier. The reason why I didn't choose option 3 is because theoretically user can have thousands or tens of thousands of answers, and it would not scale well.
STEP 3 - IMPLEMENTATION
NOTE: MongoDB will automatically assign _id to each document, so you don't have to define id property when implementing Models.
Tables from your relational database example can be mapped to collections like this (This implementation is for APPROACH 2):
USERS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
email: { type: String, required: true, unique: true },
name: { type: String, required: true, unique: false },
contributions: [{
repo_id: { type: mongoose.Schema.Types.ObjectId, ref: 'REPOS' },
lines_of_code: { type: Numeric, ref: 'REPOS' }
}]
});
const Users = mongoose.model('USERS', schema);
module.exports = Users;
ANSWERS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
user_id: { type: mongoose.Schema.Types.ObjectId, ref: 'USERS', required: true },
tag: { type: String, required: true, unique: false },
upvotes:{ type: Number, default: 0, unique: false }
});
const Answers = mongoose.model('ANSWERS', schema);
module.exports = Answers;
REPOS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
owner: { type: mongoose.Schema.Types.ObjectId, ref: 'USERS', required: true },
name: { type: String, required: true, unique: false },
description: { type: String, required: false, unique: false },
stars:{ type: Number, default: 0, unique: false },
tags: [{
name: { type: String, required: true, unique: false },
is_language: {type: Boolean, required: true, unique: false},
percentage:{ type: Number, default: 0, unique: false }
}]
});
const Repos = mongoose.model('REPOS', schema);
module.exports = Repos ;
STEP 4 - POPULATION AND DATABASE QUERIES
One of the best features of the Mongoose is called population. If you store a reference of one collection document as a property of another collection document, when performing querying of the database, Mongoose will replace references with the actual documents.
Example 1:
Let us first take as an example the first query that you suggested: Find all users with an Answer with tag [Python] with more than 2 upvotes. Since we stored user_id in ANSWERS Collection as a reference to the document from the USERS collection, that means that we can just query the ANSWERS Collection, and when returning the final result Mongoose will go to USERS collection and replace the references with the actual User documents. The database query that will perform this looks like this:
const ANSWERS = require('../models/answers');
ANSWERS.find({
"tag": "Python",
"upvotes": {
"$gt": 2
}
}).populate('user_id');
Example 2:
Second query that you have suggested is: Find all repos with the [Python] tag with more than two stars. Since we are storing all repo's tags in one array, we just need to check if that array contains an item with the name field that is equal to Python, and that the repo's stars fields is greater than 2. The database query that will perform this looks like this:
const REPOS = require('../models/repos');
REPOS.find({
"tags.name": "Python",
"stars": {
"$gt": 2
}
})
Here is also the working example: https://mongoplayground.net/p/rgBtVVDgPzG

Designing a database model is very complex most of the time and I guess you are looking for best practices from a reputable source. I think this is the missing point in the other answers, even if #NenadMilosavljevic got close to it.
Brief introduction on NoSQL modeling
You are probably used to model SQL databases, for NoSQL modeling it is totally different. These are some of the differences:
SQL Modeling
NoSQL modeling
This type of modeling is "data-oriented" in the sense that it is designed to be used and shared by many applications. Data is normalized, generally using normal forms, to avoid duplication and to make future changes easier and with lowest downtime possible.
NoSQL modeling is "application-oriented" because it should be built from the requirements of a single application, in order to reach the maximum level of optimization.
You start from requirements analysis, then the conceptual design, in the end the physical design.
If you want to optimize your application, you need to start from the app itself and from the operations needed: this is the so-called workload. After that there are conceptual and physical design of course.
I want to focus on the workload a little more because it is very important. Since you come from a SQL-based application, you can describe the workload starting from various scenarios, production logs and statistics. For each query you need, these parameters are essential:
Size of data requested
Frequency of the query
The complexity of the operations involved
Returning to the original question: "My queries will be something like..." is not enough for me to help you on building a NoSQL model. There are many solutions for your problem but, unless you provide a lot more info on the queries you need to do, they are all correct.
#NenadMilosavljevic gave you multiple approaches, but I can't say if the second one is the correct one for the reasons I have told you above. For example, he suggests to keep user and user contributions together so that you have to perform a single query to retrieve them, instead of doing JOINs or something even more expensive.
This is certainly clever but suppose, and it is probably not your case, that you have to update the user contributions very often, then in this case keeping them in a separate collection might be better.
What I am saying is that too many assumptions are missing and the solution we give you could be good but not optimal.
Honestly, it is not clear to me if you need a trivial conversion from a SQL model to NoSQL model, or you are trying to apply the NoSQL principles. I have no idea about the size of your database, but if performances are not a problem, just go with the solution you find more appropriate. Doing a research on how to better model your data would be a waste of time.
Instead, if you really need to design a NoSQL database, not a SQL-like NoSQL database, then my advice is to follow this course. Actually, you could finish it in less than 5 hours and many lessons are unnecessary in your situation, but it is worth taking a look at it. No one here talked about patterns and how to deal with one-to-zillion relationships for instance. It is very important to know their existence unless you like to redesign your database when it is too late.

Here is my suggestion. My suggestion would be 3 collections. There are user, repo and answer. Below are the schemas for reference.
user collection
id: String
email: String
name: String
repo collection
id: String
owner: String
name: String
description: String
tag: [String] // Array of string
contributors: [Number] // Array of user id
I suggest having another collection named answer. This is because a user could provide a lot of answers. Thus, having it on another collection will be easier to query compare to having it inside subdocuments of user collection.
answer collection
answer_id
user_id
tag
upvotes
I hope it is helpful.

Mongo Schema design 101: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1.
If in SQL you think about data in object oriented way - your models represent some business entities
and you build functions around them, in Mongo you should think functional way - what data you have as
input and what you need as output. In other words your schema should be based on queries you need
to run, not on data you have.
It is a bit more tricky as there is no best way - any schema will be better for some queries than for others.
You will need to choose which queries should be prioritized. To make it even more interesting, you will need
to anticipate which queries you may have in the future.
And of course it's all about data size. If it fits into single server you have luxury of aggregation lookups to "join" collections.
Otherwise sharding will significantly restrict your choices.
On the other hand embedding should be used with care - document size cannot exceed 16MB and modification
of embedded documents is not that straightforward.
The last but not least thing to consider are indexes. Your schema should allow efficient indexes for your queries. Here you will need to consider not only data size but also its quality - selectivity/cardinality
In the light of the foregoing the best schema to "grab all users with an Answer with tay [Python] with more than 2 upvotes or a repo with the [Python] tag with more than two stars" will be a 2 collections:
user:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"id": {
"bsonType": "objectId"
},
"email": {
"bsonType": "string"
},
"name": {
"bsonType": "string"
},
"answers": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"tag": {
"bsonType": "string"
},
"upvotes": {
"bsonType": "int"
}
},
"required": [
"tag",
"upvotes"
]
}
]
},
},
"required": [
"id",
"email",
"name",
"answers"
]
}
repo:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"id": {
"bsonType": "objectId"
},
"owner": {
"bsonType": "string"
},
"name": {
"bsonType": "string"
},
"description": {
"bsonType": "string"
},
"stars": {
"bsonType": "int"
},
"tags": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"tag": {
"bsonType": "string"
},
"is_language": {
"bsonType": "bool"
},
"percentage": {
"bsonType": "double"
}
},
"required": [
"tag",
"is_language",
"percentage"
]
}
]
},
"contributors": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"user_id": {
"bsonType": "objectid"
},
"lines_of_code": {
"bsonType": "int"
}
},
"required": [
"user_id",
"lines_of_code"
]
}
]
}
},
"required": [
"id",
"owner",
"name"
"description",
"stars"
]
}
with the queries:
db.user.find({answers: {$elemMatch:{tag:"Python", upvotes:{$gt:2}}}})
db.repo.find({"tags.tag":"Python", stars:{$gt:2}})
In the comment you mentioned something like "to get all the repos for a given user". Assuming it's about contributors, otherwise you don't need this array at all.
The query will be:
db.repo.find({"contributors.user_id": ObjectId("12313212313232")})

Related

Dating app Schema pattern nosql, is array extraction viable?

I'm trying to build a dating app, and for my backend, I'm using a nosql database. When it comes to the user's collection, some relations are happening between documents of the same collection. For example, a user A can like, dislike, or may haven’t had the choice yet. A simple schema for this scenario is the following:
database = {
"users": {
"UserA": {
"_id": "jhas-d01j-ka23-909a",
"name": "userA",
"geo": {
"lat": "",
"log": "",
"perimeter": ""
},
"session": {
"lat": "",
"log": ""
},
"users_accepted": [
"j2jl-564s-po8a-oej2",
"soo2-ap23-d003-dkk2"
],
"users_rejected": [
"jdhs-54sd-sdio-iuiu",
"mbb0-12md-fl23-sdm2",
],
},
"UserB": {...},
"UserC": {...},
"UserD": {...},
"UserE": {...},
"UserF": {...},
"UserG": {...},
},
}
Here userA has a reference from the users it has seen and made a decision, and stores them either in “users_accepted” or “users_rejected”. If User C hasn’t been seen (either liked or disliked) by userA, then it is clear that it won’t appear in both of the arrays. However, these arrays are unbounded and may exceed the max size that a document can handle. One of the approaches may be to extract both of these arrays and create the following schema:
database = {
"users": {
"UserA": {
"_id": "jhas-d01j-ka23-909a",
"name": "userA",
"geo": {
"lat": "",
"log": "",
"perimeter": ""
},
"session": {
"lat": "",
"log": ""
},
},
"UserB": {...},
"UserC": {...},
"UserD": {...},
"UserE": {...},
"UserF": {...},
"UserG": {...},
},
"likes": {
"id_27-82" : {
"user_give_like" : "userB",
"user_receive_like" : "userA"
},
"id_27-83" : {
"user_give_like" : "userA",
"user_receive_like" : "userC"
},
},
"dislikes": {
"id_23-82" : {
"user_give_dislike" : "userA",
"user_receive_dislike" : "userD"
},
"id_23-83" : {
"user_give_dislike" : "userA",
"user_receive_dislike" : "userE"
},
}
}
I need 4 basic queries
Get the users that have liked UserA (Show who is interested in userA)
Get the users that UserA has liked
Get the users that UserA has disliked
Get the matches that UserA has
The query 1. is fairly simple, just query the likes collection and get the users where "user_receive_like" is "userA".
Query 2. and 3. are used to get the users that userA has not seen yet, get the users that are not in query 2. or query 3.
Finally query 4. may be another collection
"matches": {
"match_id_1": {
"user_1": "referece_user1",
"user_2": "referece_user2"
},
"match_id_2": {
"user_1": "referece_user3",
"user_2": "referece_user4"
}
}
Is this approach viable and efficient?
You are right to notice, that these arrays are unbounded and pose a serious scalability problem for your application. If you were to assign 2-3 user roles to a user with the 1st approach it would be totally fine, but it is not the case for you. The official MongoDB documentation suggests that you should not use unbounded arrays: https://www.mongodb.com/docs/atlas/schema-suggestions/avoid-unbounded-arrays/
Your second approach is the superior implementation choice for you, because:
you can build indices of form (user_give_dislike, user_receive_like) which will improve your query performance even in case when you have 1M+ documents
you can store additional metadata, (like timestamps etc) on the likes collection without affecting the design of the user collection
the query for "matches" will be much simpler to write with this approach: https://mongoplayground.net/p/sFRvUniHKn8
More about NoSQL data modelling:
https://www.mongodb.com/docs/manual/data-modeling/ and
https://www.mongodb.com/docs/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
To answer your question , let me write some more assumptions about the domain and then lets try to answer it.
Assumptions:
System should support scale for 100 million users
A single user might like or dislike ~100k users in its lifetime
Also some thoery about nosql , if our queries go to all the shards of the collection then max scale of the system depends on the scale of the single shard
Now with these assumptions see the query performance of the question that you asked :
Get the users that have liked UserA (Show who is interested in userA) -
Assuming we are doing sharding or user_give_like column then if we filter on user_receive_like then it will do query on all shards , which is not the right thing for scalability
Get the users that UserA has liked
This will work fine as we have created shard based on user_give_like
Get the users that UserA has disliked
This will work fine as we have created shard based on user_give_dislike
Get the matches that UserA has
In this case if we do a join between existing users and all users which UserA has liked and disliked this will create a parallel query on all shard and is not scalable when UserA like or dislike has huge count
Now to conclude this dosen't look like a reasonable approach to me.

MongoDB Reuse _id

Say I have a simple schema for Users, which gets an automatically generated _id:
{
_id: ObjectId("9dfhdf9fdhd90dfdhdaf"),
name: "Joe Shmoe"
}
And then I have a schema for Groups where I can add Users into a members array.
{
name: "Joe's Group",
members: [{
_id: ObjectId("58fdaffdhfd9fdsahfdsfa"),
name: "Joe Shmoe"
}]
}
Objects within an array get new autogenerated IDs, but I'd like to keep that _id field consistent and reuse it so that the member in the Group has the same attributes as they do in the Users collection.
My obvious solution is to create an independent id field in the members' User object that references their _id in the Users collection, but that seems cluttered having two separate id fields for each member.
My question is, is what I'm attempting bad practice? What's the correct way to add existing objects into a collection from another collection?
I think what you're referring to is : manual or DB References in data modelling.
original_id = ObjectId()
db.places.insert({
"_id": original_id,
"name": "Broadway Center",
"url": "bc.example.net"
})
db.people.insert({
"name": "Erin",
"places_id": original_id,
"url": "bc.example.net/Erin"
})
Check the documentation here

MongoDB aggregate search with multiple fields

I am trying to build an API for search jobs
Frontend input: single filed keyword with a string
API response: Return list of jobs that match any of the following fields
skills
location
company
Schemas
1.Job schema
title: {
type: String,
required: true,
},
location: {
type: mongoose.Schema.Types.ObjectId,
ref: 'location',
},
skills:[{
type: mongoose.Schema.Types.ObjectId,
ref: 'Skill'
}],
company: {
type: mongoose.Schema.Types.ObjectId,
ref: 'company',
},
As you see skills, location and company are mapped in another collection and frontend gives no separation on the keyword I am not sure which way I can write an effective search query
Right now approach is
Find skill_id based on skill name and fetch all jobs that have desired skill
Follow the same for location and company
But I am not sure this is the right approach, can somebody' advise a proper way of doing this
many strategies can be applied in your case (yours, AdamExchange's one, aggregation with $lookup stages...) depending of size of collections, indexes, etc...
But i think think you really have to look to index and index intersection strategies to really optimize your query
I would :
first create 3 single indexes on skill.name / location.name / company indexes ===>So you can find the ids in your different collections, using index.
Create single indexes on job collection : location, skill, company
Then you can simply run your queries like this (assuming MyKeyword is the value of your frontend field) [pseudo code, i don't know the language you use]:
skillId = db.skill.find({name:MyKeyword });
locationId = db.location.find({name:MyKeyword });
companyId = db.company.find({name:MyKeyword });
db.job.find({
$or: [
{
skill: {
$eq: skillId
}
},
{
location: {
$eq: locationId
}
},
{
company: {
$eq: companyId
}
}
]
})
So you can take benefit of indexes to query 'secondary collections' and of indexes intersection for each case of your $or condition for main collection.

Multi-language attributes in MongoDB

I'm trying to design a schema paradigm in MongoDB which would support multilingual values for variable attributes in documents.
For example, I would have a product catalog where each product may require storing its name, title or any other attribute in various languages.
This same paradigm should probably hold for other locale-specific properties, such as price/currency variations
I've been considering a key-value approach where key is the language code and value is the corresponding value:
{
sku: "1011",
name: { "en": "cheese", "de": "Käse", "es": "queso", etc... },
price: { "usd": 30.95, "eur": 20, "aud": 40, etc... }
}
The problem is I believe this would deny me of using indices on multilingual fields.
Eventually, I'd like a generic, yet intuitive, index-able design.
Any suggestion would be appreciated, thanks.
Wholesale recommendations over your schema design may be a bit broad a topic for discussion here. I can however suggest that you consider putting the elements you are showing into an Array of sub-documents, rather than the singular sub-document with fields for each item.
{
sku: "1011",
name: [{ "en": "cheese" }, {"de": "Käse"}, {"es": "queso"}, etc... ],
price: [{ "usd": 30.95 }, { "eur": 20 }, { "aud": 40 }, etc... ]
}
The main reason for this is consideration for access paths to your elements which should make things easier to query. This I went through in some detail here which may be worth your reading.
It could also be a possibility to expand on this for something like your name field:
name: [
{ "lang": "en", "value": "cheese" },
{ "lang": "de", "value: "Käse" },
{ "lang": "es", "value": "queso" },
etc...
]
All would depend on your indexing and access requirements. It all really depends on what exactly your application needs, and the beauty of MongoDB is that it allows you to structure your documents to your needs.
P.S As to anything where you are storing Money values, I suggest you do some reading and start maybe with this post here:
MongoDB - What about Decimal type of value?

How to store option data in mongo?

I have a site that I'm using Mongo on. So far everything is going well. I've got several fields that are static option data, for example a field for animal breeds and another field for animal registrars.
Breeds
Arabian
Quarter Horse
Saddlebred
Registrars
AQHA
American Arabians
There are maybe 5 or 6 different collections like this that range from 5-15 elements.
What is the best way to put these in Mongo? Right now, I've got a separate collection for each group. That is a breeds collection, a registrars collection etc.
Is that the best way, or would it make more sense to have a single static data collection with a "type" field specifying the option type?
Or something else completely different?
Since this data is static then it's better to just embed the data in documents. This way you don't have to do manual joins.
And also store it in a separate collection (one or several, doesn't matter, choose what's easier) to facilitate presentation (render combo-boxes, etc.)
I believe creating multiple collections has collection size implications? (something about MongoDB creating a collection file on disk as twice the size of the previous file [ db.0 = 64MB, db.1 = 128MB and so on)
Here's what I can think of:
1. Storing as single collection
The benefits here are:
You only need one call to Mongo to fetch, and if you can cache the call you quickly have the data.
You avoid duplication: create a single schema that deals with all your options. You can just nest suboptions if there are any.
Of course, you also avoid duplication in statics/methods to modify options.
I have something similar on a project that I'm working on. I have categories and subcategories all stored in one collection. Here's a JSON/BSON dump as example:
In all the data where I need to store my 'options' (station categories in my case) I simply use the _id.
{
"status": {
"code": 200
},
"response": {
"categories": [
{
"cat": "Taxi",
"_id": "50b92b585cf34cbc0f000004",
"subcat": []
},
{
"cat": "Bus",
"_id": "50b92b585cf34cbc0f000005",
"subcat": [
{
"cat": "Bus Rapid Transit",
"_id": "50b92b585cf34cbc0f00000b"
},
{
"cat": "Express Bus Service",
"_id": "50b92b585cf34cbc0f00000a"
},
{
"cat": "Public Transport Bus",
"_id": "50b92b585cf34cbc0f000009"
},
{
"cat": "Tour Bus",
"_id": "50b92b585cf34cbc0f000008"
},
{
"cat": "Shuttle Bus",
"_id": "50b92b585cf34cbc0f000007"
},
{
"cat": "Intercity Bus",
"_id": "50b92b585cf34cbc0f000006"
}
]
},
{
"cat": "Rail",
"_id": "50b92b585cf34cbc0f00000c",
"subcat": [
{
"cat": "Intercity Train",
"_id": "50b92b585cf34cbc0f000012"
},
{
"cat": "Rapid Transit/Subway",
"_id": "50b92b585cf34cbc0f000011"
},
{
"cat": "High-speed Rail",
"_id": "50b92b585cf34cbc0f000010"
},
{
"cat": "Express Train Service",
"_id": "50b92b585cf34cbc0f00000f"
},
{
"cat": "Passenger Train",
"_id": "50b92b585cf34cbc0f00000e"
},
{
"cat": "Tram",
"_id": "50b92b585cf34cbc0f00000d"
}
]
}
]
}
}
I have a call to my API that gets me that document (app.ly/api/v1/stationcategories). I find this much easier to code with.
In your case you could have something like:
{
"option": "Breeds",
"_id": "xxx",
"suboption": [
{
"option": "Arabian",
"_id": "xxx"
},
{
"option": "Quarter House",
"_id": "xxx"
},
{
"option": "Saddlebred",
"_id": "xxx"
}
]
},
{
"option": "Registrars",
"_id": "xxx",
"suboption": [
{
"option": "AQHA",
"_id": "xxx"
},
{
"option": "American Arabians",
"_id": "xxx"
}
]
}
Whenever you need them, either loop through them, or pull specific options from your collection.
2. Storing as a static JSON document
This as #Sergio mentioned, is a viable and more simplistic approach. You can then either have separate docs for separate options, or put them in one document.
You do lose some flexibility here because you can't reference options by Id (which I prefer because changing option name doesn't affect all your other data).
Prone to typos (though if you know what you're doing this shouldn't be a problem).
For Node.js users: this might leave you with a headache from require('../../../options.json') similar to PHP.
The reader will note that I'm being negative about this approach, it works, but is rather inflexible.
Though we're discouraged from using joins unnecessarily on MongoDB, referencing by ObjectId is sometimes useful and extensible.
An example is if your website becomes popular in one region of the world, and say people from Poland start accounting for say 50% of your site visits. If you decide to add Polish translations. You would need to go back to all your documents, and add Polish names (if exists) to your options. If using approach 1, it's as easy as adding a Polish name to your options, and then plucking the Polish name from your options collection at runtime.
I could only think of 2 options other than storing each option as a collection
UPDATE: If someone has positives or negatives for either approach, may you please add them. My bias might be unhelpful to some people as there are benefits to storing static JSON files
MongoDB is schemaless and also no JOIN is supported. So you have to move out of the RDBMS and normalization given the fact that this is purely a different kind of database.
Few rules which you can apply while designing as a guidelines. Of course, you have the choice of keeping it in a separate collection when needed.
Static Master/Reference Data:
You have to always embed them in your documents wherever required. Since the data is not going to be changed, it is not at all bad idea to keep in the same collection. If the data is too large, group them and store them in a separate collection instead of creating multiple collection for the this master data itself.
NOTE: When embedding the collections as sub-documents, always make sure that you are never going to exceed the 16MB limit. That is the limit (at this point) for each collection can take in MongoDB.
Dynamic Master/Reference Data
Try to keep them in a separate collection as the master data is tend to change often.
Always remember, NO join support, so keep them in a way that you can easily access it without querying the database too many times.
So there is NO suggested best way, it always changes based on your need and the design can go either way.