MongoDB - MANY documents or MANY collections - mongodb

I am currently working on a service that stores products owned by multiple stores, and I am trying to figure how is the best way of structuring the DB. The only problem is that products are from different domains, like, clothes, toys, electronics and sold by different sellers.
First idea that came to mind was having different collections for each seller, but I see this as a headache, having to manage different DB connections.
The idea of storing all products in the same document seems bad to me, because they are from different domains.
The only idea that I believe could work in this case is like this:
Let's say we have 3 stores: Store One, Store Two, Store Three. I would create a document for each store like this: products_storeone, products_storetwo, products_storethree and access them based on an identifier each store has. Now, each store will have multiple documents to store different things like products_identifier, users_identifier, orders_identifier.
Do you consider this is a good idea?
Please tell me your opinion on what could be the best way of achieving a structure for storing items for each store independently, without mixing them.
After some calculation, there will be maximum 50 documents for each store. I don't see this as a way to big number to handle like 1.000 stores. Do you think 50.000 documents are too much? Impacts performance?
Any tips in order to achieve high performance for queries are more than welcome.
Thanks!

You can make an single db "products" and it will be an array of objects and each will be a single product. Each product will have "store" and "category" attribute by which we query efficiently.
Hope this can be of help.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

creating schema vs adding an additional field?

I want to store featured products like staff picks, featured products of each category in my system that will hold at most 10 documents. My priority is read performance over write performance but I also want to have an efficient storage system and I have three ways to do it in my mind:
Create a boolean field such as is_bestseller, is_staffpick in Products schema and query for it.
I think this is the simplest way to do it but I think it would require an additional query to check if the at most 10 limit has been reached.
Create a FeaturedProducts schema that holds references of product ids.
This is useful in the sense that if I want to add some additional info such as a featured product within the featured products then I could simply add a field in this schema. It would also be easy to check the at most 10 documents limit. I think this makes it more scalable but at the cost of performance?
Create a FeaturedProducts schema that will hold all the needed data.
I think performance wise this would be the best but I'm not sure if this is an efficient way to store data. Basically, I would just duplicate the data of a product and store it. Obviously, if I have to update product details then I have to update it in two places now but the read-to-write ratio heavily favors reading so I am willing to do this even if it's gonna require more logic regarding updating and deleting products. Also it would be easy to set at most 10 documents limit.
I tried to look for some examples regarding featured products but couldn't find anything useful. I am not sure what the best practice is here and which way to go about so any kind of help is appreciated.
The rule of thumb when modeling your data in MongoDB is:
Data that is accessed together should be stored together.
Havin that in mind I considered The Extended Reference Pattern a great options for you use case, here is a example from the MongoDB Blog.
Considere an e-commerce application where you have user collection, order collection and others. Where users and orders has a 1-N relation, embedding all of the information about a customer for each order just to reduce the JOIN operation results in a lot of duplicated information.
Instead of duplicating all of the information on the customer, we only copy the fields we access frequently.
This schema will have height read performance, because all the information will be store in a single document, at the cost of some duplicate data, but that is not completely bad considering that it can sever as history data.
Useful information:
Patterns
Design Anti-pattern
A potential solution is to use an index here so that you can maximize your query performance. You would create an additional boolean flag (as you indicated in your first solution) then index that query, with a cursor that limits the number of returned values.
For more ways to increase your query performance check out the official Mongo docs here. If you're curious as to how much more performant your queries become, you can use Mongo's explain() method to get benchmarks (more info here) and compare approaches.
Best of luck!

MongoDB beginner - to normalize or not to normalize?

I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar
Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that
Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/
I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.

One to many Mongo strategy with query on both collections

I've got two collections: one with ~7.600.000 documents containing information about available trips and one with ~5000 documents containing information about hotels with region, city and country data. The trips collection has field with id of certain hotel.
my problem is, that I have to query both collections to get information about certain trip: location information from hotels collection and other information like price, number of people etc from trips collection.
I've read about mapreduce strategy of merging two collections, but i think that it won't fit in my case because it'll create only 5000 documents if I link them using hotel id? Is it possible?
Another approach is two embed hotels information in trip collection, but I'm afraid of updating hotel information in this case.
Please give me some advice, and tell which approach will be the best?
You have many options. It's all about deciding where to "join" the data. The options:
Join on the front end. Maybe bring back all trips first and then use AJAX calls to lazily load the hotel information. (Assuming a web application). Point being, two calls might not be the worst thing!
Use map/reduce in Mongo to output the data as you want it. It won't work in real time, but it will give you the right results. It wouldn't be limited to 5,000 documents. You could start with the bigger trip collection and bring in what you need. It's very flexible.
Embed the hotel information. As a note, you only want to embed hotel information if it's not changing all that often. If it changes constantly, I would consider leaving things as is.
For a lot of the work I do with Mongo, I've found that two calls isn't so bad - especially when dealing with fast changing data.

Deciding whether to create new collection or put data in existing collection using Mongo DB

I have data coming in from two sources, facebook and twitter. For each source I have multiple handles (pepsi, coke, sprite) and I want to determine the best way to organize my database.
Is it better practice to...
a. make two collections, one for twitter and one for facebook and have all all three handles in both collections?
b. make one collection and put all of that information in that single collection?
Thanks for your help. Mongodb is awesome.
Depends on many factors really, but generally speaking...
If the data is somewhat similar and/or should be queried (aggregated) together then single collection is probably the best choice.
If the data from twitter and fb should be processed in totally different ways then perhaps separated collections is a more appropriate solution.