How to handle large data sets in MongoDB - mongodb

I need help in deciding which schema type is more appropriate for my mongodb collection.
Let's say I want to store a list of things a person have. There will be relatively small number of people, but one person can have very many things. Let's assume people will be count in hundreds, but things a person own in hundreds of thousands.
I can think of two options:
Option 1:
[{
id: 1,
name: "Tom",
things: [
{
name: 'red tie',
weight: 0.3,
value: 5
},
{
name: 'carpet',
weight: 15,
value: 700
} //... and 300'000 other things
]
},
{
id: 2,
name: "Rob",
things: [
{
name: 'can of olives',
weight: 0.4,
value: 2
},
{
name: 'Porsche',
weight: 1500,
value: 40000
}// and 170'000 other things
]
}//and 214 oher people]
]
Option 2:
[
{
name: 'red tie',
weight: 0.3,
value: 5,
owner: {
name: 'Tom',
id: 1
}
},
{
name: 'carpet',
weight: 15,
value: 700,
owner: {
name: 'Tom',
id: 1
}
},
{
name: 'can of olives',
weight: 0.4,
value: 2,
owner: {
name: 'Rob',
id: 2
}
},
{
name: 'Porsche',
weight: 1500,
value: 40000,
owner: {
name: 'Rob',
id: 2
}
}// and 20'000'000 other things
];
I will only ask for things from one owner in a single request and never ask for things from multiple owners.
I will need a pagination for the returned list of things so...
... things will need to be sorted by one of the parameters
From what I understand the first point suggest it would be much more efficient to use Option 1 (querying only few hundreds documents instead of millions), but points 2 and 3 are handled much more easily when using Option 2 (limit, skip and sort methods instead of $slice projection and Aggregation Framework).
Can anybody tell me which way would be more suitable? Or maybe I've got something wrong and there's even better solution?

I will only ask for things from one owner in a single request and never ask for things from multiple owners.
I will need a pagination for the returned list of things so...
things will need to be sorted by one of the parameters
Your requirements 2 and 3 would be fulfilled much better by creating a collection where each item is an individual document. With an array, you would have to use the aggregation framework to $unwind that array, which can become quite slow. Your first requirement can easily be optimized for by creating an index on the owner.name or owner.id field of said collection, depending on which you use for querying.
Also, MongoDB does not handle growing documents very well. To discourage users from creating indefinitely growing documents, MongoDB has a 16MB per document limit. When each of your items is a few hundred byte, hundreds of thousands of array entries would exceed that limit.

Related

Advice on structuring data for scalability with large number of nested objects

Looking for advice on how best to structure data in MongoDB, particularly for scalability - worried about having an array of potentially thousands of objects within each user object.
I am building a language learning app with a built in flashcard system. I want users to 'unlock' new vocabulary for each level, which automatically gets added to their flashcards, so when you unlock level 4, all the vocabulary attached to level 4 gets added to your flashcards.
For the flashcards themselves, I want a changable 'due date', so that you get prompted to do certain cards at a certain date - if you're familiar with spaced repition, that's the plan. So when you get a card, you can say how well you know it and, for example, if you know it well you won't get it for another week, but if you get it wrong you'll get it again the next day.
I'm using MongoDB for the backend, but am a little unsure about how best to structure my data. Currently, I have two objects: one for the cards, and one for the users.
The cards object looks like this, so there's a nested object for each flashcard, with a unique ID, the level the word appears in, and then the word in both languages.
const CardsList = [
{
id: 1,
level: 1,
gd: "sgìth",
en: "tired",
},
{
id: 2,
level: 2,
gd: "ceist",
en: "question",
},
];
Then each user has an object like the below, with various user data, and a nested array of objects for the cards - with the id of every card they've unlocked, and the date at which that card is next due.
{
id: 1,
name: "gordon",
level: 2,
cards: [
{ id: 1, date: "07/12/2021" },
{ id: 2, date: "09/12/2021" },
],
},
{
id: 2,
name: "mike",
level: 1,
cards: [
{ id: 1, date: "08/12/2021" },
{ id: 2, date: "07/12/2021" },
],
},
This works fine, but I'm a bit concerned about the scalability of it.
The plan is to have about two or three thousand words in total, and so if I had, say, fifty users complete the app, then that would mean fifty user objects, each with as much as three thousand objects in that nested cards array.
Is that going to be a problem? Would it be a problem if I had a thousand (or more) users, instead of 50? Is there a more sensible way of structuring the data that I'm not spotting?

Atomic consistency

My lecturer in the database course I'm taking said an advantage of NoSQL databases is that they "support atomic consistency of a single aggregate". I have no idea what this means, can someone please explain it to me?
It means that by using aggregates you are able to avoid that your database save inconsistence data by an error of transaction.
In Domain Driven Design, an aggregate is a collection of related objects that are treated as an unit.
For example, lets say you have a restaurant and you want to save the orders of each customer.
You could save your data with two aggregates like below:
var customerIdGenerated = newGuid();
var customer = { id: customerIdGenerated , name: 'Mateus Forgiarini'};
var orders = {
id: 1,
customerId: customerIdGenerated ,
orderedFoods: [{
name: 'Sushi',
price: 50
},
{
name: 'Tacos',
price: 12
}]
};
Or you could threat orders and customers as a single aggregate:
var customerIdGenerated = newGuid();
var customerAndOrders = {
customerId: customerIdGenerated ,
name: 'Mateus Forgiarini',
orderId: 1,
orderedFoods: [{
name: 'Sushi',
price: 50
},
{
name: 'Tacos',
price: 12
}]
};
By setting your orders and customer as a single aggregate you avoid an error of transaction. In the NoSQL world an error of transaction can occur when you have to write a related data in many nodes (a node is where you store your data, NoSQL databases that run on clusters can have many nodes).
So if you are treating orders and customers as two aggregates, an error can occur while you are saving the customer but your orders can still be saved, so you would have an inconsistency data because you would have orders with no customer.
However by making use of a single aggregate, you can avoid that, because if an error occur, you won't have an inconsistency data, since you are saving your related data together.

With meteor.js and mongo, please show me the best way to organize a categories collection

I'm coming from the SQL world, so naturally mongo / noSQL has been an adventure.
I'm building a page to add/edit categories, that "posts" will later be assigned to.
What I've basically created is this:
{
_id: "asdf234ljsf",
title: "CategoryOne",
sortorder: 1,
active: true,
children: [
{
title: ChildOne,
sortorder: 1,
active: true
},
{
title: ChildTwo,
sortorder: 2,
active: true
}
]
}
So later, when creating a "post" I would assign that post to one or more parent categories, as well as optionally one or more child categories within the selected parent categories. Visitors to the site if they clicked on a parent category, it would show all posts within that parent category, and if they select a child category, it will only show posts within that child category.
The logic is obvious and simple, but in SQL I would have created tables like this:
table_Category ( CategoryID, Title, Sort, Active )
table_Category_Children ( ChildID, ParentID, Title, Sort, Active )
I've been reading the Discover Meteor book and it mentions that Meteor gives us many tools that work a lot better when operating at the collection level, as well as how the DDP operates at the top level of a document, meaning if something small changed down in a sub collection or array, potentially unneeded data will be sent back to all connected/subscribed clients.
So, this makes me think I should be organizing the categories like this:
Collection for parent categories
{
_id: "someid",
title: "CategoryOne"
sortorder: 1,
active: true
},
{
_id: "someid",
title: "CategoryTwo"
sortorder: 1,
active: true
}
Collection for Child Categories
{
_id: "someid",
parent: "idofparent"
title: "ChildOne"
sortorder: 1,
active: true
},
{
_id: "someid",
parent: "idofparent"
title: "ChildTwo"
sortorder: 1,
active: true
}
Or, perhaps its better like this:
Collection for parent categories
{
_id: "someid",
title: "CategoryOne"
sortorder: 1,
active: true,
children: [ { id: "childid" }, ... ]
}
I think understanding a best practice/method for Meteor and Mongo in this scenario will help me greatly across the board.
So conclusion: I have an admin page where I add/edit these categories. When clients create a post, they'll select the parent and child categories suitable for their post and make sure that I organize it properly from the beginning. Changing my thinking process from a traditional RDBMS to NoSQL is a big jump.
Thank you!
MongoDB stores all data in documents. This is a fundamental difference from relational database like SQL.
Imagine if you have 100 parent categories and 1000 child categories, once you update a parent category it will affect all linked child category's "idofparent", in a reactive way. In short, it's not sustainable.
Try to think of a way to avoid JOIN SQL equivalent in MongoDB.
Restructure you data perhaps similar to this way:
One big collection for all categories:
{
_id: id,
title: title,
sortorder: 1,
active: 1,
class: "parent > child" // make this as a field
...
}
// class can be "parent1", "parent2", "parent1 > child1" ... you get the idea
so each document store is completely individual.
Or if you absolutely need JOIN relational data structure, I don't think MongoDB is the right choice for you.

Best way to structure relationships in a no-SQL database?

I'm using MongoDB. I know that MongoDB isn't relational but information sometimes is. So what's the most efficient way to reference these kinds of relationships to lessen database load and maximize query speed?
Example:
* Tinder-style "matches" *
There are many users in a Users collection. They get matched to each other.
So I'm thinking:
Document 1:
{
_id: "d3fg45wr4f343",
firstName: "Bob",
lastName: "Lee",
matches: [
"ferh823u9WURF",
"8Y283DUFH3FI2",
"KJSDH298U2F8",
"shdfy2988U2Ywf"
]
}
Document 2:
{
_id: "d3fg45wr4f343",
firstName: "Cindy",
lastName: "Doe",
matches: [
"d3fg45wr4f343"
]
}
Would this work OK if there were, say, 10,000 users and you were on Bob's profile page and you wanted to display the firstName of all of his matches?
Any alternative structures that would work better?
* Online Forum *
I supposed you could have the following collections:
Users
Topics
Users Collection:
{
_id: "d3fg45wr4f343",
userName: "aircon",
avatar: "234232.jpg"
}
{
_id: "23qdf3a3fq3fq3",
userName: "spider",
avatar: "986754.jpg"
}
Topics Collection Version 1
One example document in the Topics Collection:
{
title: "A spider just popped out of the AC",
dateTimeSubmitted: 201408201200,
category: 5,
posts: [
{
message: "I'm going to use a gun.",
dateTimeSubmitted: 201408201200,
author: "d3fg45wr4f343"
},
{
message: "I don't think this would work.",
dateTimeSubmitted: 201408201201,
author: "23qdf3a3fq3fq3"
},
{
message: "It will totally work.",
dateTimeSubmitted: 201408201202,
author: "d3fg45wr4f343"
},
{
message: "ur dumb",
dateTimeSubmitted: 201408201203,
author: "23qdf3a3fq3fq3"
}
]
}
Topics Collection Version 2
One example document in the Topics Collection. The author's avatar and userName are now embedded in the document. I know that:
This is not DRY.
If the author changes their avatar and userName, these change would need to be updated in the Topics Collection and in all of the post documents that are in it.
BUT it saves the system from querying for all the avatars and userNames via the authors ID every single time this thread is viewed on the client.
{
title: "A spider just popped out of the AC",
dateTimeSubmitted: 201408201200,
category: 5,
posts: [
{
message: "I'm going to use a gun.",
dateTimeSubmitted: 201408201200,
author: "d3fg45wr4f343",
userName: "aircon",
avatar: "234232.jpg"
},
{
message: "I don't think this would work.",
dateTimeSubmitted: 201408201201,
author: "23qdf3a3fq3fq3",
userName: "spider",
avatar: "986754.jpg"
},
{
message: "It will totally work.",
dateTimeSubmitted: 201408201202,
author: "d3fg45wr4f343",
userName: "aircon",
avatar: "234232.jpg"
},
{
message: "ur dumb",
dateTimeSubmitted: 201408201203,
author: "23qdf3a3fq3fq3",
userName: "spider",
avatar: "986754.jpg"
}
]
}
So yeah, I'm not sure which is best...
If the data is realy many to many i.e. one can have many matches and can be matched by many in your first example it is usually best to go with relations.
The main arguments against relations stem from mongodb not beeing a relational database so there are no such things as foreign key constraints or join statements.
The trade off you have to consider in those many to many cases (many beeing much more than two) is either enforce the key constraints yourself or manage the possible data inconsistencies accross the multiple documents (your last example). And in most cases the relational approach is much more practical than the embedding approach for those cases.
Exceptions could be read often write seldom examples. For (a very constructed) example when in your first example matches would be recalculated once a day or so by wiping all previous matches and calculating a list of new matches. In that case the data inconsistencies you would introduce could be acceptable and the read time you save by embedding the firstnames of the matches could be an advantage.
But usually for many to many relations it would be best to use a relational approach and make use of the array query features such as {_id :{$in:[matches]}}.
But in the end it all comes down to the consideration of how many inconsistencies you can live with and how fast you realy need to access the data (is it ok for some topics to have the old avatar for a few days if I save half a second of page load time?).
Edit
The schema design series on the mongodb blog might be a good read for you: part1, part2 and part3

Is there a MongoDB maximum bson size work around?

The document I am working on is extremely large. It collects user input from an extremely long survey (like survey monkey) and stores the answers in a mongodb database.
I am unsurprisingly getting the following error
Error: Document exceeds maximal allowed bson size of 16777216 bytes
If I cannot change the fields in my document is there anything I can do? Is there some way to compress down the document, by removing white space or something like that?
Edit
Here is the structure of the document
Schema({
id : { type: Number, required: true },
created: { type: Date, default: Date.now },
last_modified: { type: Date, default: Date.now },
data : { type: Schema.Types.Mixed, required: true }
});
An example of the data field:
{
id: 65,
question: {
test: "some questions",
answers: [2,5,6]
}
// there could be thousands of these question objects
}
One thing you can do is to build your own mongoDB :-). Mongodb is an open source and the limitation about the size of a document is rather arbitrary to enforce a better schema design. You can just modify this line and build it for yourself. Be careful with this.
The most straight forward idea is to have each small question in a different document with a field which reference to its parent.
Another idea is to limit number of documents in the parent. Lets say you limit is N elements then the parent looks like this:
{
_id : ObjectId(),
id : { type: Number, required: true },
created: { type: Date, default: Date.now }, // you can store it only for the first element
last_modified: { type: Date, default: Date.now }, // the same here
data : [{
id: 65,
question: {
test: "some questions",
answers: [2,5,6]
}
}, ... up to N of such things {}
]
}
This way modifying number N you can make sure that you will be in 16 MB of BSON. And in order to read the whole survey you can select
db.coll.find({id: the Id you need}) and then combine the whole survey on the application level. Also do not forget to ensureIndex on id.
Try different things, do a benchmark on your data and see what works for you.
You should be using gridfs. It allows you to store documents in chunks. Here's the link: http://docs.mongodb.org/manual/reference/gridfs/