How one approaches for defining containers for MongoDB? - mongodb

First of all, I have extensive experience on Relational DBs but very beginner level knowledge of Document DB. I'm exploring MongoDB but my question is in general to Document DB.
AFA I know (I may be wrong), A Document DB is consisting of containers and containers contain same of different object structures. These object structures are defined such a way that filters and information can be applied in most optimum way. For ex. A is written by Authors. So object of Book will contain list of authors also. This way searching can be made faster and performance can be gained.
What is my problem?
I'm creating an application (yet haven't started as I'm confused here). It's relational DB is something like this....
The problem is I'm not able to design the Document DB structure for this requirement.
Please somebody help my in designing such database or can give me idea on "What approach one should select while designing such database?"

This comes down to answering the following questions:
What are your most common access patterns? It is helpful to think of your API methods, or top 5-10 queries to decide how to organize.
What are your transactional needs? Which of these entity types occur together in transactions and queries?
How often do they change? Should you embed or reference?
If you could include these details, we can help with more targeted suggestions.
http://azure.microsoft.com/documentation/articles/documentdb-modeling-data/ is also worth a read if you haven't already.
The main difference between DocumentDB and relational databases/MongoDB is that collections are more like shards/partitions and not tables.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

What NoSQL database is good for complex many to many relationships?

My startup project require complex data structure, which must be ready for geolocation and fulltext search. The information in the database are added by two different types of users, who together build a complex relationship network.
For a better understanding of my questions, here is a simple diagram that shows this relationship:
My first choice was MongoDB and Elasticsearch, but I noticed the problem of multiplication of the same data. Further planning, we concluded that for certain parts of the application need ACID transactions possibilities.
What NoSQL database is good for complex many to many relationships?
What would be a good choice for us?
A graph database like Neo4j is perfect for that.
If you need features of a document database like MongoDB, but with Neo4j under the hood, try Structr: http://de.slideshare.net/AxelMorgner/neo4j-as-document-database http://structr.org
(Disclaimer: I'm the project inititator of Structr)

NoSQL vs. Relational Databases vs. Possible Hybrid

I'm hearing more about NoSQL, but have yet had someone give me a clear explanation of how it is to be used instead of relational databases.
I've read that it can't do left joins, so I was trying to figure out how you'd be able to use such a data storage. From reading: Preserve Joins by code in MongoDB it seems like a suggestion is to just make a large table, as if you already did the joins on it.
If the above statement is true, then I can see how it can be used. However I'm curious on how you'd handle repeat data. As the concept of normalizing, helps you remove the redundancy and ensure consistency in the data (e.g. Slight modifications like capitalization, white space, etc)...
Are we simply sacrificing the consistency of the data for scalable speed, or am I missing something?
Edit
I've been doing some more digging and found the answers the following questions useful for clarifying my understanding:
Why Google's BigTable referred as a NoSQL database?
How do you track record relations in NoSQL?
My understanding of consistency seems to be correct from those answers. And it looks like NoSQL is suppose to be used for specific problems types and that if you need relations that you should use a relational database.
But this raises more questions like:
It makes me wonder about real life examples of when to use NoSQL versus when not to?
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
MongoDB has the ability to have documents which include arrays of other documents. This solves many cases where you would have relations in reational databases.
When an invoice has multiple positions, you wouldn't put these positions into a separate collection. You would embed them as an array.
It makes me wonder about real life examples of when to use NoSQL versus when not to?
There are many different NoSQL databases, each one designed with different use-cases in mind. But you tagged this question as MongoDB, so I assume that you mean MongoDB in particular.
MongoDB has two main advantages over relational databases.
First, it scales well.
When the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. This doesn't work nearly as well with most relational databases.
Second, it allows heterogeneous data.
Imagine, for example, the product database of a computer hardware store. What properties do products have? All products have a price and a vendor. But CPUs have a clock rate, hard drives and RAM chips have a capacity (and these capacities aren't comparable), monitors have a resolution and so on. How would you design this in a relational database? You would either create a very long productID-property-value table or you would create a very wide and sparse product table with every property you can imagine, but most of them being NULL for most products. Both solutions aren't really elegant. But MongoDB can solve this much better because it allows each document in a collection to have a different set of properties.
What can't it do?
As a rather new technology, there isn't that much literature about it. The software ecosystem around it isn't that well either. The tools you can get for relational databases are often much more shiny.
There are also some use-cases MongoDB isn't well-suited for.
MongoDB doesn't do JOINs. When your data is very relational and denormalizing it would be counter-productive, it might be a poor choice for your product. But you might want to take a look at graph databases like Neo4j, which focus even more on relations than relational databases. Update 2016: MongoDB 3.2 now has rudimentary JOIN support with the $lookup aggregation stage, but it's still very limited in functionality compared to relational and graph databases.
MongoDB doesn't do transactions. At least not complex transactions. Certain actions which only affect a single document are guaranteed to be atomic, but as soon as you affect more than one document, you can't guarantee that no other query will happen in-between and find an inconsistent state.
MongoDB is bad for ad-hoc reporting. Its options for data-mining are severely limited. The rather new aggregation functions help and MapReduce can also solve some surprisingly complex problems when you learn to use it smart, but SQL has usually the better tools for things like that.
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Relational databases are around for about 40 years. Their theory is a well-researched topic in computer science. There are whole libraries of books written about the theory behind them. There is a by-the-book solution for every imaginable corner-case by now.
But NoSQL databases, on the other hand, are a rather new technology. We are still figuring out the best practices. The most frequent advise is: "Use your own head. Think about what queries are performed most often, and optimize your data schema for them."
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
When possible I would advise against using two different database technologies in the same product:
Anyone who maintains and supports the product must be familiar with both technologies
Troubleshooting gets a lot harder
The sysadmins need to keep an additional database running and updated
You have an additional point of failure which can lead to downtime
I would only recommend to mix database technologies when fulfilling your requirements without it doesn't just become hard but physically impossible. Otherwise, make your pick and stay with it.

MongoDB highlights

I'm just getting started with MongoDB and the whole principle of NoSQL and I'm really enjoying the experience.
Has anyone got any suggestions of features that I should look into?
Has anyone come across any in-depth tutorials?
One thing that in particular I'm interested in is how to deal with what is annalagous to a change in schema. If a column is added to a table, all records would have this new column. Seemingly in Mongo, if a document ends up with a new property only new documents would have this property with an update being required to add this onto all other docuemts. Is there a better approach to this or am fixating on non-existent problems?
First of all, you need to realise that data-modeling is going to be totally different from a relational database. I would suggest that you follow a few schema design presentations that are available through http://www.10gen.com/presentations (unselect "featured" and select "view all" and search for "schema"). http://www.10gen.com/presentations/webinar/mongodb-schema-design-how-to-think-non-relational is a particularly good one.
There are lots of other "schema design" tutorials online as well, and I have to stress that simply converting your relational schema into MongoDB is not going to get the best out of MongoDB as it's a totally different approach.

What NoSQL database to use as replacement for MySQL?

When it comes to NoSQL, there are bewildering number of choices to select a specific NoSQL database as is clear in the NoSQL wiki.
In my application I want to replace mysql with NOSQL alternative. In my application I have user table which has one to many relation with large number of other tables. Some of these tables are in turn related to yet other tables. Also I have a user connected to another user if they are friends.
I do not have documents to store, so this eliminates document oriented NoSQL databases.
I want very high performance.
The NOSQL database should work very well with Play Framework and scala language.
It should be open source and free.
So given above, what NoSQL database I should use?
I think you may be misunderstanding the nature of "document databases". As such, I would recommend MongoDB, which is a document database, but I think you'll like it.
MongoDB stores "documents" which are basically JSON records. The cool part is it understands the internals of the documents it stores. So given a document like this:
{
"name": "Gregg",
"fave-lang": "Scala",
"fave-colors": ["red", "blue"]
}
You can query on "fave-lang" or "fave-colors". You can even index on either of those fields, even the array "fave-colors", which would necessitate a many-to-many in relational land.
Play offers a MongoDB plugin which I have not used. You can also use the Casbah driver for MongoDB, which I have used a great deal and is excellent. The Rogue query DSL for MongoDB, written by FourSquare is also worth looking at if you like MongoDB.
MongoDB is extremely fast. In addition you will save yourself the hassle of writing schemas because any record can have any fields you want, and they are still searchable and indexable. Your data model will probably look much like it does now, with a users "collection" (like a table) and other collections with records referencing a user ID as needed. But if you need to add a field to one of your collections, you can do so at any time without worrying about the older records or data migration. There is technically no schema to MongoDB records, but you do end up organizing similar records into collections.
MongoDB is one of the most fun technologies I have happened to come across in the past few years. In that one happy Saturday I decided to check it out and within 15 minutes was productive and felt like I "got it". I routinely give a demo at work where I show people how to get started with MongoDB and Scala in 15 minutes and that includes installing MongoDB. Shameless plug if you're into web services, here's my blog post on getting started with MongoDB and Scalatra using Casbah: http://janxspirit.blogspot.com/2011/01/quick-webb-app-with-scala-mongodb.html
You should at the very least go to http://try.mongodb.org
That's what got me started.
Good luck!
At this point the answer is none, I'm afraid.
You can't just convert your relational model with joins to a key-value store design and expect it to be a 1:1 mapping. From what you said it seems that you do have joins, some of them recursive, i.e. referencing another row from the same table.
You might start by denormalizing your existing relational schema to move it closer to a design you wish to achieve. Then, you could see more easily if what you are trying to do can be done in a practical way, and which technology to choose. You may even choose to continue using MySQL. Just because you can have joins doesn't mean that you have to, which makes it possible to have a non-relational design in a relational DBMS like MySQL.
Also, keep in mind - non-relational databases were designed for scalability - not performance! If you don't have thousands of users and a server farm a traditional relational database may actually work better for you.
Hmm, You want very high performance of traversal and you use the word "friends". The first thing that comes to mind is Graph Databases. They are specifically made for this exact case.
Try Neo4j http://neo4j.org/
It's is free, open source, but also has commercial support and commercial licensing, has excellent documentation and can be accessed from many languages (REST interface).
It is written in java, so you have native libraries or you can embedd it into your java/scala app.
Regarding MongoDB or Cassendra, you now (Dec. 2016, 5 years late) try longevityframework.org.
Build your domain model using standard Scala idioms such as case classes, companion objects, options, and immutable collections. Tell us about the types in your model, and we provide the persistence.
See "More Longevity Awesomeness with Macro Annotations! " from John Sullivan.
He provides an example on GitHub.
If you've looked at longevity before, you will be amazed at how easy it has become to start persisting your domain objects. And the best part is that everything persistence related is tucked away in the annotations. Your domain classes are completely free of persistence concerns, expressing your domain model perfectly, and ready for use in all portions of your application.