I have a document that looks like this :
{
"_id": "chatID"
"presence": {
"userID1": 1647627240464,
"userID2": 1647227540464
},
}
I need to query for each userID the chats where he is present and order by the timestamp in the presence map.
I am aware that this is probably not the best way to do, before i had 1 element per user meaning duplicating the chatIDs, but it's a pain to update them all because it would look like :
{
"_id": "userID1chatID",
"at": 1647627240464,
"ids": "30EYwO01_Nyq7dMqe_O3vfL3AH",
"members": ["userID1", "userID2", "userID3"],
"owner": "userID1",
"present": true,
"uid": "chatID",
"url": "databaseURL"
}
This would allow me to find the chats where userID1 is present: true and order by at DESCENDING.
The problem with this is that i need to update the at attribute for all the documents (one per user) for this same chat room.
How can i do this same query while maintaining a single document with present as a map ?
Problem : the index would be on a variable : userID1, userID2, etc...
like : present.userID1 and seems to not be convenient for use when userID1 can be removed from the present map if the user leaves the chat.
Please let me know if this is unclear, thanks in advance.
I'm playing around with MongoDB and was wondering what best practices are for how a SQL-ish schema may correspond to MongoDB. Here are the tables/data I have so far:
user
id
email
name
answer
user_id (FK user.id)
tag
upvotes
repo
id
owner
name
description
stars
repo_tag
repo_id (FK to repo.id)
tag
is_language
percentage
repo_contrib
repo_id (FK to repo.id)
user_id (FK to user.id)
lines_of_code
The structure goes something like this:
user
answer (left outer)
repo_contrib (left outer)
repo
repo_tag
Note: All users will have at least one answer or one repo, but does not necessarily have to have both.
How might I put this into a mongo schema? Would this be one 'collection' ? Or would this be two collections: one for user, and one for repo; or more?
My queries will be something like: "Grab all users with an Answer with tay [Python] with more than 2 upvotes or a repo with the [Python] tag with more than two stars.
Let me divide this to couple of steps:
STEP 1 - MONGODB and MONGOOSE
MongoDB is a document based database. Each record in a collection is a document, and every document should be self-contained (it should contain all information that you need inside it).
Since MongoDB is a no-relation database, you can not create relations between collections, but you can store a reference of one collection document as a property of another collection document. To help you manage all of this, there is a great package called Mongoose, which will allow you to create a Model for each Collection. After you define
Models, Mongoose will allow you to easily make queries to database.
STEP 2 - DEFINING MODELS
As we said, documents should be self-contained, so they should have all information that you need inside them. We can have 2 approaches based on your example:
APPROACH 1:
Create one collection for each table that you have in your relational database. This is the best practice when you have documents with a lot of data, because it is scalable.
APPROACH 2:
Create 3 Collections - USERS, ANSWERS and REPOS. Because repo_contrib does not have a lot of data, you can store all user's contributions in a USERS document. That way, when you fetch a User document, you will have everything that you need in one place. The same goes for repo_tag - we can store all repo's tags in a REPOS document.
APPROACH 3:
Create 2 Collections - USERS and REPOS. The same as APPROACH 2, but you can also add all user's answers to the USERS document.
RECOMMENDATION:
I would go with APPROACH 2 in this case, since repo_contrib and repo_tag does not store big data and can easily be stored in USERS and REPOS documents with no problem. Also, if we go with this approach, it will make querying database a lot easier. The reason why I didn't choose option 3 is because theoretically user can have thousands or tens of thousands of answers, and it would not scale well.
STEP 3 - IMPLEMENTATION
NOTE: MongoDB will automatically assign _id to each document, so you don't have to define id property when implementing Models.
Tables from your relational database example can be mapped to collections like this (This implementation is for APPROACH 2):
USERS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
email: { type: String, required: true, unique: true },
name: { type: String, required: true, unique: false },
contributions: [{
repo_id: { type: mongoose.Schema.Types.ObjectId, ref: 'REPOS' },
lines_of_code: { type: Numeric, ref: 'REPOS' }
}]
});
const Users = mongoose.model('USERS', schema);
module.exports = Users;
ANSWERS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
user_id: { type: mongoose.Schema.Types.ObjectId, ref: 'USERS', required: true },
tag: { type: String, required: true, unique: false },
upvotes:{ type: Number, default: 0, unique: false }
});
const Answers = mongoose.model('ANSWERS', schema);
module.exports = Answers;
REPOS Collection:
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
var schema = new Schema({
owner: { type: mongoose.Schema.Types.ObjectId, ref: 'USERS', required: true },
name: { type: String, required: true, unique: false },
description: { type: String, required: false, unique: false },
stars:{ type: Number, default: 0, unique: false },
tags: [{
name: { type: String, required: true, unique: false },
is_language: {type: Boolean, required: true, unique: false},
percentage:{ type: Number, default: 0, unique: false }
}]
});
const Repos = mongoose.model('REPOS', schema);
module.exports = Repos ;
STEP 4 - POPULATION AND DATABASE QUERIES
One of the best features of the Mongoose is called population. If you store a reference of one collection document as a property of another collection document, when performing querying of the database, Mongoose will replace references with the actual documents.
Example 1:
Let us first take as an example the first query that you suggested: Find all users with an Answer with tag [Python] with more than 2 upvotes. Since we stored user_id in ANSWERS Collection as a reference to the document from the USERS collection, that means that we can just query the ANSWERS Collection, and when returning the final result Mongoose will go to USERS collection and replace the references with the actual User documents. The database query that will perform this looks like this:
const ANSWERS = require('../models/answers');
ANSWERS.find({
"tag": "Python",
"upvotes": {
"$gt": 2
}
}).populate('user_id');
Example 2:
Second query that you have suggested is: Find all repos with the [Python] tag with more than two stars. Since we are storing all repo's tags in one array, we just need to check if that array contains an item with the name field that is equal to Python, and that the repo's stars fields is greater than 2. The database query that will perform this looks like this:
const REPOS = require('../models/repos');
REPOS.find({
"tags.name": "Python",
"stars": {
"$gt": 2
}
})
Here is also the working example: https://mongoplayground.net/p/rgBtVVDgPzG
Designing a database model is very complex most of the time and I guess you are looking for best practices from a reputable source. I think this is the missing point in the other answers, even if #NenadMilosavljevic got close to it.
Brief introduction on NoSQL modeling
You are probably used to model SQL databases, for NoSQL modeling it is totally different. These are some of the differences:
SQL Modeling
NoSQL modeling
This type of modeling is "data-oriented" in the sense that it is designed to be used and shared by many applications. Data is normalized, generally using normal forms, to avoid duplication and to make future changes easier and with lowest downtime possible.
NoSQL modeling is "application-oriented" because it should be built from the requirements of a single application, in order to reach the maximum level of optimization.
You start from requirements analysis, then the conceptual design, in the end the physical design.
If you want to optimize your application, you need to start from the app itself and from the operations needed: this is the so-called workload. After that there are conceptual and physical design of course.
I want to focus on the workload a little more because it is very important. Since you come from a SQL-based application, you can describe the workload starting from various scenarios, production logs and statistics. For each query you need, these parameters are essential:
Size of data requested
Frequency of the query
The complexity of the operations involved
Returning to the original question: "My queries will be something like..." is not enough for me to help you on building a NoSQL model. There are many solutions for your problem but, unless you provide a lot more info on the queries you need to do, they are all correct.
#NenadMilosavljevic gave you multiple approaches, but I can't say if the second one is the correct one for the reasons I have told you above. For example, he suggests to keep user and user contributions together so that you have to perform a single query to retrieve them, instead of doing JOINs or something even more expensive.
This is certainly clever but suppose, and it is probably not your case, that you have to update the user contributions very often, then in this case keeping them in a separate collection might be better.
What I am saying is that too many assumptions are missing and the solution we give you could be good but not optimal.
Honestly, it is not clear to me if you need a trivial conversion from a SQL model to NoSQL model, or you are trying to apply the NoSQL principles. I have no idea about the size of your database, but if performances are not a problem, just go with the solution you find more appropriate. Doing a research on how to better model your data would be a waste of time.
Instead, if you really need to design a NoSQL database, not a SQL-like NoSQL database, then my advice is to follow this course. Actually, you could finish it in less than 5 hours and many lessons are unnecessary in your situation, but it is worth taking a look at it. No one here talked about patterns and how to deal with one-to-zillion relationships for instance. It is very important to know their existence unless you like to redesign your database when it is too late.
Here is my suggestion. My suggestion would be 3 collections. There are user, repo and answer. Below are the schemas for reference.
user collection
id: String
email: String
name: String
repo collection
id: String
owner: String
name: String
description: String
tag: [String] // Array of string
contributors: [Number] // Array of user id
I suggest having another collection named answer. This is because a user could provide a lot of answers. Thus, having it on another collection will be easier to query compare to having it inside subdocuments of user collection.
answer collection
answer_id
user_id
tag
upvotes
I hope it is helpful.
Mongo Schema design 101: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1.
If in SQL you think about data in object oriented way - your models represent some business entities
and you build functions around them, in Mongo you should think functional way - what data you have as
input and what you need as output. In other words your schema should be based on queries you need
to run, not on data you have.
It is a bit more tricky as there is no best way - any schema will be better for some queries than for others.
You will need to choose which queries should be prioritized. To make it even more interesting, you will need
to anticipate which queries you may have in the future.
And of course it's all about data size. If it fits into single server you have luxury of aggregation lookups to "join" collections.
Otherwise sharding will significantly restrict your choices.
On the other hand embedding should be used with care - document size cannot exceed 16MB and modification
of embedded documents is not that straightforward.
The last but not least thing to consider are indexes. Your schema should allow efficient indexes for your queries. Here you will need to consider not only data size but also its quality - selectivity/cardinality
In the light of the foregoing the best schema to "grab all users with an Answer with tay [Python] with more than 2 upvotes or a repo with the [Python] tag with more than two stars" will be a 2 collections:
user:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"id": {
"bsonType": "objectId"
},
"email": {
"bsonType": "string"
},
"name": {
"bsonType": "string"
},
"answers": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"tag": {
"bsonType": "string"
},
"upvotes": {
"bsonType": "int"
}
},
"required": [
"tag",
"upvotes"
]
}
]
},
},
"required": [
"id",
"email",
"name",
"answers"
]
}
repo:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"id": {
"bsonType": "objectId"
},
"owner": {
"bsonType": "string"
},
"name": {
"bsonType": "string"
},
"description": {
"bsonType": "string"
},
"stars": {
"bsonType": "int"
},
"tags": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"tag": {
"bsonType": "string"
},
"is_language": {
"bsonType": "bool"
},
"percentage": {
"bsonType": "double"
}
},
"required": [
"tag",
"is_language",
"percentage"
]
}
]
},
"contributors": {
"bsonType": "array",
"items": [
{
"bsonType": "object",
"properties": {
"user_id": {
"bsonType": "objectid"
},
"lines_of_code": {
"bsonType": "int"
}
},
"required": [
"user_id",
"lines_of_code"
]
}
]
}
},
"required": [
"id",
"owner",
"name"
"description",
"stars"
]
}
with the queries:
db.user.find({answers: {$elemMatch:{tag:"Python", upvotes:{$gt:2}}}})
db.repo.find({"tags.tag":"Python", stars:{$gt:2}})
In the comment you mentioned something like "to get all the repos for a given user". Assuming it's about contributors, otherwise you don't need this array at all.
The query will be:
db.repo.find({"contributors.user_id": ObjectId("12313212313232")})
NoSQL newbie here..
I have Employee documents and every Employee has a name and has one to many tags. Here is a possible representation of an employee object in JSON format:
{
"name": "John Doe",
"tags": ["blue", "red", "green"]
}
I want to be able to query Employee instances in Cosmos DB by their tags. For example, I want to find an Employee where tags contains 'green'. An Employee will not have too many tags, maybe up to 10 or 15 at most.
What is the best way to model the document structure for this use case? cosmos db documentation here suggests a structure akin to following for a reason I do not understand:
{
"name": "John Doe",
"tags": [
{
"name": "blue"
},
{
"name": "red"
}
]
}
Is there any reason to split a String array into child JSON objects like this?
How to model documents is totally based on your requirement, there is no strict rule for that.
For your doc structure, I did some test on my side and this all my test doc,4 docs in total:
I can use the query below to find out all employees that contain the "green" tag:
SELECT c.name,c.tags FROM c where ARRAY_CONTAINS(c.tags, "green")
I am storing my contacts in mongodb like this but main drawback of this schema is I am not able to store 40k-50k contacts in one document due to limit of 16mb.
I want to change my schema now. So can anyone please suggest me best way to redesign this.
Here is my sample doucument
{
"_id" : ObjectId("5c53653451154c6da4623a77"),
"contacts" : [{
name:"",
email:"",
group:[5c53653451154c6da4623a79]
}],
"groups" : [{
_id: ObjectId("5c53653451154c6da4623a79"),
group_name:"test"
}],
}
According to you document sample, contacts belongs to a group.
In that scenario, there are different ways to end up with a better schema:
1- Document embedding:
You will have an array of contacts inside each group document.
collection groups:
{
"_id": ObjectId("5c53653451154c6da4623a79"),
"group_name":"test",
"contacts": [
{
"name":"something",
"email":"something",
},
{
"name":"something else",
"email":"something else",
}
]
}
2- Document referencing:
You will have two collections - contacts and groups - and store a group reference inside each contact.
collection contacts:
{
"_id" : ObjectId("5c53653451154c6da4623a77"),
"name":"something",
"email":"something",
"groups":["5c53653451154c6da4623a79"]
},
{
"_id" : ObjectId("5c536s7df9sd7f987d9s7d98"),
"name":"something else",
"email":"something else",
"groups":["5c53653451154c6da4623a79"]
}
collection groups:
{
"_id": ObjectId("5c53653451154c6da4623a79"),
"group_name":"test"
}
Why are we referencing group inside contact and not the contrary? Because we probably will have more contacts than groups. This way we have smaller documents with smaller "reference arrays".
The path you will follow depends a lot on how many contacts you have per group. If this number is small, I would take the Document Embedding approach, for the sake of simplicity and access easiness. If you have a large number of contacts per group, I would use Document Reference, to have smaller documents.
I'm using MongoDB for my senior project to allow donors to sponsor children for an organization similar to world vision and I'm wondering about the feasibility of reusing the ObjectID datatype as a unique id within a json array.
example donor document:
{
_id: ObjectID(x), // this document's _id
name: "Josh",
last_name: "Richard",
address: "2 Happy Lane",
city: "New York",
state: "New York",
credit_card: 999999999999999,
cvv: 999,
exp: 3/18,
transactions: [
{
_id: ObjectID(x), // this is what i'm asking about
child_ids: [
ObjectID(x), // this is the _id of a doc in another collection
ObjectID(x) // this is also the _id of a doc in another collection
]
},
{
_id: ObjectID(x),
child_ids: [
ObjectID(x), // this is the _id of a doc in another collection
]
}
]
}
The transactions._id is what I'm looking for feedback on. I need this to be unique so that I can differentiate orders from each other, but I'd rather not create a new collection for storing these as documents and instead keep it all in the donor doc. Any thoughts?
EDIT: What I'm not curious about is the technical uniqueness of an ObjectID. What I'm looking to find out is if it's realistic to "repurpose" the ObjectID as a unique key to then use later in application logic. I need some way to uniquely identify objects in an array across all instances of that type of array in every document. I.E.: Every transaction needs to have it's own unique identifier and those transactions are stored in arrays in many different documents within the same collection.