mongodb- schema design - mongodb

I am using mongodb as my backend. I have data for movies, music, books and more which I am storing in one single collection. The compulsory fields for every bson entry are "_id", "name", "category". Rest of the fields depend upon the category to which the entry belongs.
For example, I have a movie record stored like.
{
"_id": <some_id>,
"name": <movie_name>,
"category": "movie",
"director": <director_name>,
"actors": <list_of_actors>,
"genre": <list_of_genre>
}
For music, I have,
{
"_id": <some_id>,
"name": <movie_name>,
"category": "music"
"record_label": <label_name>
"length": <length>
"lyrics": <lyrics>
}
Now I have 12 different categories for which only _id, name and category are common fields. Rest the fields are all different for different categories. Is my decision to store all data in one single collection fine or should I make different collections per category.

A single collection is best if you're searching across categories. Having the single collection might slow performance on inserts, but if you don't have a high write need, that shouldn't matter.

MongoDB allows you to store any field structure in a document even if every document is different, so that isn't a concern. By having those 3 consistent fields then you can use those as part of the index and to handle your queries. This is a good example of where a schemaless database helps because you can store everything in a single collection.
There is no performance hit for using a single collection in this way. Indeed, there is actually a benefit because you can shard the collection as a scaling strategy later. Sharding is done on a collection level so you could shard based on the _id field to have them evenly distributed, or use your category field to have certain categories per shard, or even a combination.
One thing to be aware of is future query requirements. If you do need to index the other fields then you can use sparse indexes which mean that documents without the indexed fields won't be in the index, so won't take any space in the index; a handy optimisation.
You should also be aware of growing the documents if you made updates. This does have a major performance impact.

Related

How to improve terrible MongoDB query performance when aggregating with arrays

I have a data schema consisting of many updates (hundreds of thousands+ per entity) that are assigned to entities. I'm representing this with a single top-level document for each of the entities and an array of updates under each of them. The schema for those top-level documents looks like this:
{
"entity_id": "uuid",
"updates": [
{ "timestamp": Date(...), "value": 10 },
{ "timestamp": Date(...), "value": 11 }
]
}
I'm trying to create a query that returns the number of entities that have received an update within the past n hours. All updates in the updates array are guaranteed to be sorted by virtue of the manner in which they're updated by my application. I've created the following aggregation to do this:
db.getCollection('updates').aggregate([
{"$project": {last_update: {"$arrayElemAt": ["$updates", -1]}}},
{"$replaceRoot": {newRoot: "$last_update"}},
{"$match": {timestamp: {"$gte": new Date(...)}}},
{"$count": "count"}
])
For some reason that I don't understand, the query I just pasted takes an absurd amount of time to complete. It exhausts the 15-second timeout on the client I use, as a matter of fact.
From a time complexity point of view, this query looks incredibly cheap (which is part of the way I designed this schema that way I did). It looks to be linear with respect to the total number of top-level documents in the collection which are then filtered down, of which there are less than 10,000.
The confusing part is that it doesn't seem to be the $project step which is expensive. If I run that one alone, the query completes in under 2 seconds. However, just adding the $match step makes it time out and shows large amounts of CPU and IO usage on the server the database is running on. My best guess is that it's doing some operations on the full update array for some reason, which makes no sense since the first step explicitly limits it to only the last element.
Is there any way I can improve the performance of this aggregation? Does having all of the updates in a single array like this somehow cause Mongo to not be able to create optimal queries even if the array access patterns are efficient themselves?
Would it be better to do what I was doing previously and store each update as a top-level document tagged with the id of its parent entity? This is what I was doing previously, but performance was quite bad and I figured I'd try this schema instead in an effort to improve it. So far, the experience has been the opposite of what I was expecting/hoping for.
Use indexing, it will enhance the performance of your query.
https://docs.mongodb.com/manual/indexes/
For that use the mongo compass to check which index is used most then one by one index them to improve the performance of it.
After that fetch on the fields which you require in the end, with projection in aggregation.
I hope this might solve your issue. But i would suggest that go for indexing first. Its a huge PLUS in case of large data fetching.
You need to support your query with an index and simplify it as much as possible.
You're querying against the timestamp field of the first element of the updates field, so add an index for that:
db.updates.createIndex({'updates.0.timestamp': 1})
You're just looking for a count, so get that directly:
db.updates.count({'updates.0.timestamp': {$gte: new Date(...)}})

Single Collection vs Multiple collections with same fields in mongodb

I have a collection in mongodb by the name Order. It has many fields some of which are mentioned below:
Order
{
"id": "1",
"name": "Hello1",
"orderType": "Type1",
"date": "2016-09-23T15:07:38.000Z"
...... //11 more fields
}
In the above collection, I have mentioned only 4 fields but in actual collection, I have around 15 fields. Order can be of 2 types: Type1 or Type2. It is mentioned in orderType field.
Now, for the Type2 order, I have all the 15 fields but it also have additional 5-7 fields. I have a single Order collection and I was wondering should I create 2 different collections for each type of orders; or whether I could keep this collection only and add additional fields here only. I have already written most of the logic considering only 1 collection Will it be worth the effort making it 2 different collections? If I keep a single collection, am I losing anything in terms of performance?
Keeping data on single collection is usually better in terms of performance since you can get required data mostly in a single query.
Since from the question I can mostly think it is a one-to-one relationship, embedded documents would be beneficial. Secondly, you are yourself saying that you have already written most of the logic considering only one collection then you should go forward with one collection only. Unless you have one-to-many relationships.
I would recommend you to go through Data Model Design and understand when and which model is better.
Embedded document would be best as you have one-to-one relationship.  For more info please check here.

Update embedded document in Mongodb: Performance issue?

I am new to Mongodb and I heard that Mongodb is good for massive amount of read and write operations.
Embedded document is one of the features that make it happen. But I am not sure if it is also a cause of performance issue.
Book document example:
{
"_id": 1,
"Authors": [
{
"Email": "email",
"Name": "name"
}
],
"Title": "title",
...
}
If there are thousands of books by one author, and his email needs to be updated, I need to write some query which can
search through all book documents, pick out those thousands ones with this author
update author's email field across these book documents
These operations do not seem efficient. But this type of update is ubiquitous, I believe the developers have considered this. So, where did I get it wrong?
Your current embedded schema design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk.
If your application frequently accesses books information along with the Authors data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data.
To illustrate this, say you want all books by one author have his email field updated, this can be done with one single (atomic) operation, which is not a performance issue with MongoDB:
db.books.updateMany(
{ "Authors.name": "foo" },
{
"$set": { "Authors.$.email": "new#email.com" }
}
);
or with earlier MongoDB versions:
db.books.update(
{ "Authors.name": "foo" },
{
"$set": { "Authors.$.email": "new#email.com" }
},
{ "multi": true }
)
In the above, you use the positional $ operator which facilitates updates to arrays that contain embedded documents by identifying an element in an array to update without explicitly specifying the position of the element in the array. Use it with the dot notation on the $ operator.
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, especically Model One-to-Many Relationships with Embedded Documents.
The other design option which you can consider is referencing documents where you follow a normalized schema. For example:
// db.books schema
{
"_id": 3
"authors": [1, 2, 3] // <-- array of references to the author collection
"title": "foo"
}
// db.authors schema
/*
1
*/
{
"_id": 1,
"name": "foo",
"surname": "bar",
"address": "xxx",
"email": "foo#mail.com"
}
/*
2
*/
{
"_id": 2,
"name": "abc",
"surname": "def",
"address": "xyz",
"email": "abc#mail.com"
}
/*
3
*/
{
"_id": 3,
"name": "alice",
"surname": "bob",
"address": "xyz",
"email": "alice#mail.com"
}
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of author documents per give book entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
For querying a normalized schema, you can consider using the aggregation framework's $lookup operator which performs a left outer join to the authors collection in the same database to filter in documents from the books collection for processing.
Thus said, I believe your current schema is a better approach than creating a separate collection of authors since separate collections require more work i.e. finding an book + its authors is two queries and requires extra work whereas the above schema embedded documents are easy and fast (single seek). There are no big differences for inserts and updates. So, separate collections are good if you need to select individual documents, need more control over querying, or have huge documents. Embedded documents are also good when you want the entire document, the document with a $slice of the embedded authors, or with no authors at all.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
I think you basically have the wrong schema design. MongoDB allows you to structure your data heirarchically, but that isnt an excuse for structuring it inefficiently. If its likely you'll need to update thousands of documents across entire collections on a regular basis then its worth asking if you have the right schema design.
There are lots of articles about covering schema design, and the comparison with relational DB structures. For example:
http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1

Which mongo document schema/structure is correct?

I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}
The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.
I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link
There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?

The way mongodb optimized for aggregation and querying

I'm using MongoDB in my project for statistical and data analysis things. My goal is design data to have best performance and scaleability.
Let's assume I have several shops and a list of unique products per shop. And I need to query some data about the products, calculate some basic statistic (only by curtain shop).
Which way is better from the performance point of view: to have a Shop document and a list of products inside and then make querying only per this document.
Or better will be having separate collection with all products per all shops in it and then build queries for that collection?
Maybe the question itself: does mongodb could query through the body of one document with such efficient manner like through many documents.
UPD 1:
For now let's assuming that products itself is quite small (Id, Price, Name, Count) and the amount of it is limited. (So I know for sure that it won't be more than 1000 products per shop)
UPD2
Also lets assuming that I don't want to read that database for the view purposes, just for statistics. (How much sold, which is most interesting, what groups and so on)
As with all these question one of the main deciding factors is data size and growth.
Will your data per shop exceed 16 meg? Judging by how many items a shop can have and how much data can be attributed to just a single item I would yes, very quickly.
What I mean is imagine how many fields you have for a product:
Product id
Description
Price
Options
Currency
blurb
SKU
Barcode (or whatever)
Some of these fields will be quite big, For example, the description of the product could be massive.
However if on the off chance this is a very simple application and you are looking at a product that can be be fully contained in a single data row and shops which will never have more than 5-8,000 items then you could do better with subdocuments of the sort:
{
_id: ObjectId(),
shop_name: 'toys r us',
items: [
{ p_id: ObjectId(), price: '1000000', currency: 'GBP', description: 'fkf' }
]
}
Subdocuments do not come without their price though. Imagine you have a document that only has one subdocument, in 10 days has 100 and in 20, 1000.
The fragmentation caused by the consistently growing documents could be quite significant. This lowers your performance for one. Not only will your performance become a problem but also fixing fragmentation is not a nice job and then later solving it in the applications logic is even harder.
To understand more about how MongoDB actually works inside you can view this presentation: http://www.10gen.com/presentations/storage-engine-internals
As for querying on a subdocument, it does require a little extra work on MongoDBs end but it is still quite cheap (cheaper than multiple round trips) providing you set it up right.
Personally based on the information I have given above I would go for two collections but I don't know the true extent of your scenario...
Edit
UPD 1: For now let's assuming that products itself is quite small (Id, Price, Name, Count) and the amount of it is limited. (So I know for sure that it won't be more than 1000 products per shop)
Okay so your documents are small, probably a couple of bytes each. In this case you might be able to use subdocuments here with power of 2 sizes allocation to remedy some of that fragmentation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes
This could create a performant operation, still 1 to 1000 subdocuments can cause fragmentation however those fragments should be filled by smaller "new" shop documents when they come into existence.
UPD2 Also lets assuming that I don't want to read that database for the view purposes, just for statistics. (How much solds, which is most interesting, what groups and so on)
So per shop, using subdocuments, you could easily get the totals of how much sold per shop like:
db.shops.aggregate([
// Match shop id 1
{$match: {_id: 1}},
// unwind the products for that shop
{$unwind: '$products'},
// Group back up by shop id and total amount sold
{$group: {_id: '$_id', total_sold: {$sum: '$products.sold'}}}
])
Using the new aggregation framework (since version 2.1): http://docs.mongodb.org/manual/applications/aggregation/
So subdocuments can be just as easy as two separate collections to query on as well.