How to design the tables/entities in NoSQL DB? - mongodb

I do my first steps in NoSQL databases, thus I would like to hear the best practices about implementing the following requirement.
Let suppose I have a messages database, which is powered by MongoDB engine. This DB contains a collection of documents, where each document has the following fields:
time stamp;
message author/source;
message content.
Now, I want to build a list of authors/sources in order to add some metadata about each source. In the case of the classical RDBMS, I would define a table tblSources where I would store the names of the message sources and all additional meta-data (or links to the relevant tables) for each author.
What is the right approach to such task in NoSQL/MongoDB world?

It really depends on how you want to use the data. NoSQL dbs are generally not designed with fast joins in mind but they are still capable of doing joins and storing foreign keys.
Your options here are really
duplicate data aka store the author metadata in every document. This might be better in the case where you are really trying to optimize lookups and use Mongo as a key value store
Join on foreign key - this is pretty similar to how you would use a RDBMS

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

MongoDB object model design with list property

I just started to use MongoDB and I'm confused to build object models with list property.
I have a User model related to Followers and Following object which are list of User IDs.
So I can think of some object model structures to represent the relation.
Embedded Document. Followers and Following are embedded to User model. In this way, a "current_user" object is generated in many web frameworks in every request, and it's an extra overhead to serialize/deserialize the Follower and Following list property since we seldom use these properties in most requests. We can exclude these properties when "current_user" is generated. However, we need to fetch full "current_user" object again before we do any updates to it.
Use Reference Property in User model. We can have Followers and Following object models themselves, not embedded, but save references to the User object.
Use Reference Property in Followers and Following models. We can save User ID in Follower and Following property for later queries.
There might be some other ways to do it, easier to use or better performance. And my question is:
What's the suggested way to design a model with some related list properties?
For folks coming from the SQL world (such as myself) one of the hardest things to learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
In the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
How much data do you have?
What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries?
What are your most common queries?
What are your most common updates?
How many I/O operations do you expect per second?
Here's how these questions might play out if you are considering one-to-many object relationships.
In SQL you simply create a pair of master/detail tables with a primary key/foreign key relationship. In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach.
The correct approach would depend on a lot of details about the use case of your application.
Here are some good general references on MongoDB schema design.
MongoDB presentations:
http://www.10gen.com/presentations/mongosf2011/schemabasics
http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
http://www.10gen.com/presentations/mongosf2011/schemascale
http://www.10gen.com/presentations/MongoNYC-2012/Building-a-MongoDB-Power-Chat-Server
Here are a couple of books about MongoDB schema design that I think you would find useful:
http://www.manning.com/banker/ (MongoDB in Action)
http://shop.oreilly.com/product/0636920018391.do (Document Design for MongoDB)
Here are some sample schema designs:
http://docs.mongodb.org/manual/use-cases/
https://openshift.redhat.com/community/blogs/designing-mongodb-schemas-with-embedded-non-embedded-and-bucket-structures

MongoDB: Should you still provide IDs linking to other collections to or just include collections?

I'm pretty new to MongoDB and NoSQL in general. I have a collection Topics, where each topics can have many comments. Each comment will have metadata and whatnot making a Comments collection useful.
In MySQL I would use foreign keys to link to the Comments table, but in NoSQL should I just include the a Comments collection within the Topics collection or have it be in a separate collection and link by ids?
Thanks!
Matt
It depends.
It depends on how many of each of these type of objects you expect to have. Can you fit them all into a single MongoDB document for a given Topic? Probably not.
It depends on the relationships - do you have one-to-many or many-to-many relationships? If it's one-to-many and the number of related entities is small you might chose to put embed them in an IList on a document. If it's many-to-many you might chose to use a more traditional relationship or you might chose to embed both sides as ILists.
You can still model relationships in MongoDB with separate collections BUT there are no joins in the database so you have to do that in code. Loading a Topic and then loading the Comments for it might be just fine from a performance perspective.
Other tips:
With MongoDB you can index INTO arrays on documents. So don't think of an Index as just being an index on a simple field on a document (like SQL). You can use, say, a Tag collection on a Topic and index into the tags. (See http://www.mongodb.org/display/DOCS/Indexes#Indexes-Arrays)
When you retrieve or write data you can do a partial read and a partial write of any document. (see http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields)
And, finally, when you can't see how to get what you want using collections and indexes, you might be able to achieve it using map reduce. For example, to find all the tags currently in use sorted by their frequency of use you would map each Topic emitting the tags used in it, and then you would reduce that set to get the result you want. You might then store the result of that map reduce permanently and only up date it when you need to.
It's a fairly significant mind-shift from relational thinking but it's worth it if you need the scalability and flexibility a NOSQL approach brings.
Also look at the Schema Design docs (http://www.mongodb.org/display/DOCS/Schema+Design). There are also some videos/slides of several 10Gen presentations on schema design linked on the Mongo site. See http://www.mongodb.org/pages/viewpage.action?pageId=17137769 for an overview.

Is MongoDB object-oriented?

In the website of MongoDB they wrote that MonogDB is Document-oriented Database, so if the MongoDB is not an Object Oriented database, so what is it? and what are the differences between Document and Object oriented databases?
This may be a bit late in reply, but just thought it is worth pointing out, there are big differences between ODB and MongoDB.
In general, the focus of ODB is tranparent references (relations) between objects in an arbitarily complex domain model without having to use and manage code for something like a DBRef. Even if you have a couple thousand classes, you don't need to worry about managing any keys, they come for free and when you create instances of those 1000's of classes at runtime, they will automatically create the schema in the database .. even for things like a self-referencing object with collections of collections.
Also, your transactions can span these references, so you do not have to use a completely embedded model.
The concepts are those leveraged in ORM solutions like JPA, the managed persistent object life-cycle, is taken from the ODB space, but the HUGE difference is that there is no mapping AT ALL in the ODB and relations are stored as part of the database so there is no runtime JOIN to resolve relations, all relations are resolved with the same speed as a b-tree look-up. For those of you who have used Hibernate, imagine Hibernate without ANY mapping file and orders of magnitude faster becase there is no runtime JOIN behind the scenes.
Also, ODB allows queries across any relationship in your model, so you are not restricted to queries in a particular collection as you are in MongoDB. Of course, hash/b-tree/aggregate indexes are supported to so queries are very fast when they are used.
You can evolve instances of any class in an ODB at the class level and at runtime the correct class version is resolved. Quite different than the way it works in MongoDB maintaining code to decide how to deal with varied forms of blob ( or value object ) that result from evolving a schema-less database ... or writing the code to visit and change every value object because you wanted to change the schema.
As far as partioning goes, I think it is a lot easier to decide on a partitioning model for a domain model which can talk across arbitary objects, then it is to figure out the be-all, end-all embedding strategy for your collection contained documents in MongoDB. As a rediculous example, you have a Contact and an Address and a ShoppingCart and these are related in a JSON document and you decide to partition on Contact by Contact_id. There is absolutely nothing to keep you from treating those 3 classes as just objects instead of JSON documents and storing those with a partition on Contact_id just as you would with MongoDB. However, if you had another object Account and you wanted to manage those in a non-embedded way because of some aggregate billing operations done on accounts, you can have that for free ( no need to create code for a DBRef type ) in the ODB ... and you can choose to partition right along with the Contact or choose to store the Accounts in a completely separate physical node, yet it will all be connected at runtime in the application space ... just like magic.
If you want to see a really cool video on how to create an application with an ODB which shows distribution, object movement, fault tolerance, performance optimization .. see this ( if you want to skip to the cool part, jump about 21 minutes in and you will avoid the building of the application and just see the how easy it is to add distribution and fault tolerance to any existing application ):
http://www.blip.tv/file/3285543
I think doc-oriented and object-oriented databases are quite different. Fairly detailed post on this here:
http://blog.10gen.com/post/437029788/json-db-vs-odbms
Document-oriented
Documents (objects) map nicely to
programming language data types
Embedded documents and arrays reduce
need for joins
Dynamically-typed (schemaless) for
easy schema evolution
No joins and no (multi-object)
transactions for high performance and
easy scalability
(MongoDB Introduction)
In my understanding MongoDB treats every single record like a Document no matter it is 1 field or n fields. You can even have embedded Documents inside a Document. You don't have to define a schema which is very strictly controlled in other Relational DB Systems (MySQL, PorgeSQL etc.). I've used MongoDB for a while and I really like its philosophy.
Object Oriented is a database model in which information is represented in the form of objects as used in object-oriented programming (Wikipedia).
A document oriented database is a different concept to object and relational databases.
A document database may or may not contain field, whereas a relational or object database would expect missing fields to be filled with a null entry.
Imagine storing an XML or JSON string in a single field on a database table. That is similar to how a document database works. It simply allows semi-structured data to be stored in a database without having lots of null fields.

Am I missing something about Document Databases?

I've been looking at the rise of the NoSql movement and the accompanying rise in popularity of document databases like mongodb, ravendb, and others. While there are quite a few things about these that I like, I feel like I'm not understanding something important.
Let's say that you are implementing a store application, and you want to store in the database products, all of which have a single, unique category. In Relational Databases, this would be accomplished by having two tables, a product and a category table, and the product table would have a field (called perhaps "category_id") which would reference the row in the category table holding the correct category entry. This has several benefits, including non-repetition of data.
It also means that if you misspelled the category name, for example, you could update the category table and then it's fixed, since that's the only place that value exists.
In document databases, though, this is not how it works. You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data, and errors are much more difficult to correct. Thinking about this more, doesn't it also mean that running queries like "give me all products with this category" can lead to result that do not have integrity.
Of course the way around this is to re-implement the whole "category_id" thing in the document database, but when I get to that point in my thinking, I realize I should just stay with relational databases instead of re-implementing them.
This leads me to believe I'm missing some key point about document databases that leads me down this incorrect path. So I wanted to put it to stack-overflow, what am I missing?
You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data [...]
True, denormalizing means storing additional data. It also means less collections (tables in SQL), thus resulting in less relations between pieces of data. Each single document can contain the information that would otherwise come from multiple SQL tables.
Now, if your database is distributed across multiple servers, it's more efficient to query a single server instead of multiple servers. With the denormalized structure of document databases, it's much more likely that you only need to query a single server to get all the data you need. With a SQL database, chances are that your related data is spread across multiple servers, making queries very inefficient.
[...] and errors are much more difficult to correct.
Also true. Most NoSQL solutions don't guarantee things such as referential integrity, which are common to SQL databases. As a result, your application is responsible for maintaining relations between data. However, as the amount of relations in a document database is very small, it's not as hard as it may sound.
One of the advantages of a document database is that it is schema-less. You're completely free to define the contents of a document at all times; you're not tied to a predefined set of tables and columns as you are with a SQL database.
Real-world example
If you're building a CMS on top of a SQL database, you'll either have a separate table for each CMS content type, or a single table with generic columns in which you store all types of content. With separate tables, you'll have a lot of tables. Just think of all the join tables you'll need for things like tags and comments for each content type. With a single generic table, your application is responsible for correctly managing all of the data. Also, the raw data in your database is hard to update and quite meaningless outside of your CMS application.
With a document database, you can store each type of CMS content in a single collection, while maintaining a strongly defined structure within each document. You could also store all tags and comments within the document, making data retrieval very efficient. This efficiency and flexibility comes at a price: your application is more responsible for managing the integrity of the data. On the other hand, the price of scaling out with a document database is much less, compared to a SQL database.
Advice
As you can see, both SQL and NoSQL solutions have advantages and disadvantages. As David already pointed out, each type has its uses. I recommend you to analyze your requirements and create two data models, one for a SQL solution and one for a document database. Then choose the solution that fits best, keeping scalability in mind.
I'd say that the number one thing you're overlooking (at least based on the content of the post) is that document databases are not meant to replace relational databases. The example you give does, in fact, work really well in a relational database. It should probably stay there. Document databases are just another tool to accomplish tasks in another way, they're not suited for every task.
Document databases were made to address the problem that (looking at it the other way around), relational databases aren't the best way to solve every problem. Both designs have their use, neither is inherently better than the other.
Take a look at the Use Cases on the MongoDB website: http://www.mongodb.org/display/DOCS/Use+Cases
A document db gives a feeling of freedom when you start. You no longer have to write create table and alter table scripts. You simply embed details in the master 'records'.
But after a while you realize that you are locked in a different way. It becomes less easy to combine or aggregate the data in a way that you didn't think was needed when you stored the data. Data mining/business intelligence (searching for the unknown) becomes harder.
That means that it is also harder to check if your app has stored the data in the db in a correct way.
For instance you have two collection with each approximately 10000 'records'. Now you want to know which ids are present in 'table' A that are not present in 'table' B.
Trivial with SQL, a lot harder with MongoDB.
But I like MongoDB !!
OrientDB, for example, supports schema-less, schema-full or mixed mode. In some contexts you need constraints, validation, etc. but you would need the flexibility to add fields without touch the schema. This is a schema mixed mode.
Example:
{
'#rid': 10:3,
'#class': 'Customer',
'#ver': 3,
'name': 'Jay',
'surname': 'Miner',
'invented': [ 'Amiga' ]
}
In this example the fields "name" and "surname" are mandatories (by defining them in the schema), but the field "invented" has been created only for this document. All your app need to don't know about it but you can execute queries against it:
SELECT FROM Customer WHERE invented IS NOT NULL
It will return only the documents with the field "invented".