I took the time and see the entire Hadi Hariri presentation of CouchDB for .NET Developers that took place in OreDev conference last year.
And I keep asking myself, where should I use such way to store data?
What small, medium and large examples can be took using a noSQL model?
In what application context I would save the data in JSON, and that do not follow a pattern? In what application context the retrieving of such data would be better and faster (along the application time) comparing to the process of getting from a SQL server? Licencing price? Is that the only one?
Let me share our case: we use a NoSQL system of document type to store and search our documents in full text. This requires a full-text indexing. We also do a facet search on the entire data. That is, we produce only "hit" count for a specific search broke down to some categories that we need. You can imagine an electronic shop selling photo cameras, so the facet search here can take place in price ranges. Thus you would be able to say, to which price range what types of cameras belong.
If you think about using a NoSQL system for document search, then small dataset would be order of GB's (let's say up to 10), a medium up to 100GB and large dataset of size up to 1TB. This is based on what I have seen people use Apache SOLR (from their mail-list) and what data volume we have in our company.
There are other types of NoSQL systems and associated use / business cases, where you can utilize them in conjuction with SQL systems or alone. You can have a look on this short PP presentation I made for an introductory talk on NoSQL systems: http://www.slideshare.net/dmitrykan/nosql-apache-solr-and-apache-hadoop
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We're starting a new project and looking for an appropriate storage solution for our case.
Main requirements for the storage are as follows:
Ability to support highly flexible and connected domain
Ability to support queries like "give all children of that item and items linked to that children" in ms
Full text search
Ad hoc analytics
Solid read and write performance
Scalability (as we want to offer a Saas version of our product)
First of all we eliminated all RDBMS, since we have really flexible schema which can also be changed by the customer (add new fields etc.),
so supporting such solution in any RDBMS can become a nightmare...
And we came to NoSQL. We evaluated sevaral NoSQL storage engines and chose 3 most appropriate (as we think).
MongoDB
Pros:
Appropriate to store aggregates with flexible structure (as we have
them)
Scalability/Maturity/Support/Community
Experience with MongoDB on previous project
Drivers, cloud support
Analitycs
Price (it's free)
Cons:
No support for relationships (relly important for us as we have a lot of connected items)
Slow retrieval of connected data (all joins happen in app)
Neo4j:
Pros:
Support of conencted data in modeling, flexibility
Fast retrieval of interconnected data
Drivers, cloud support
Maturity/Support/Comminity (if we compare with other graph Dbs)
Cons:
No support for aggregate storage (we would like to have aggregates in one vertex than in several)
Scalability (as far as I know, now all data is duplicated on other servers)
Analitics ?
Write performance ? (read several blogs where customers complained on its write performance)
Price (it is not free for commercial software)
OrientDB
Pros:
It seems that OrientDB has all the features that we need (aggregates and graphdb in one solution)
Price (looks like is't free)
Cons:
Immaturity (comparing with others)
Really small company behind the technology (In particular one main contributor), so questions about support, known issues etc.
A lot of features, but do they work pretty well
So now, the main dilemma for as is between Neo4j and OrientDB (MongoDb is a third option because its lack of relationships that are really important in our case - this post explains the pitfalls). I've searched for any benchmarks/comparison of these dbs, but all all of them are old. Here is a comparison by features http://vschart.com/compare/neo4j/vs/orientdb. So now we need an advice from people who already used these dbs, what to choose. Thanks in advance.
I think there are interesting trade-offs with each of these:
MongoDB doesn't do graphs;
Neo4j's nodes are flat key-value properties;
OrientDB forces you to choose between graphs and documents (can't do both simultaneously).
So your choice is between a graph store (neo4j or orient) and a document store (mongo or orient). My sense is that MongoDB is the leading document store and Neo4j is the leading graph database which would lead me to pick one of thse. But since connectivity is important, I'd lean towards the graph database and take Neo4j.
Neo4j's scalability is proven: it's in use for graphs larger than Facebook's and by enormous companies like Walmart and EBay. So if your problem is anywhere between 0-120% of Facebook's social graph, Neo4j has you covered. Write throughput is fine with Neo4j - I get in excess of 2,000 proper ACID Transactions per second on a laptop and I can easily queue writes to multiply that out.
Everything else is pretty equal: you can choose to pay for any of these or use them freely under their open source licenses (including Neo4j if you can work with GPL/AGPL). Neo4j's paid licenses have great support (up 24x7x365, 1 hour turnaround worldwide) versus OrientDB's rather lacklustre support (4 hour turnaround in the EU daytime only), and I imagine MongoDB has good support too (though I have not checked up on it).
In short, there's a reason Neo4j is the top database for connected data: it kicks ass!
Jim
To correct some misconceptions regarding mongoDB
Relations are supported, by either linking to other documents or embedding them. Please see the Data Modeling Introduction in the mongoDB docs for details. It may be that you are forced to trade normalization against speed, though. However, there are use cases in which embedding is the better solution compared to relations. Think of orders: When embedding order items and their price, you do not need to have a price history table for each and every product ever sold.
What is not supported are JOINs. Which you can circumvent by embedding documents, as mentioned above.
MongoDB can be used for tree structures. Please see Model Tree Structures with Materialized Paths for details. This approach seems to be the most appropriate way to implement a tree structure for the mentioned use case. An alternative may be an array of ancestors, depending on your needs.
That being said, mongoDB may fail in one of the basic requirements, though this really depends on how you define it: ad hoc analysis. My suggestion would be to model the intended data structure using a document oriented approach (in opposite of putting a relational approach on a document oriented database) and prototype one of the possible analysis use cases with dummy data.
let's say that I have an ecommerce website with million of products, that have millions of pageviews a day, mostly for product details pages.
Let's say that I currently have all my data in a relational DB, the old good way.
What would be the pros and cons of keeping the data in the relational DB for doing queries, aggreating and filtering products and all that...but using flat json files for the product details?
So, having 1 file per 1 product, with all details serialized to json. These files would be placed under a high-performance cdn, geographically distributed and all that. When the user goes to
www.mysite.com/prods/00123
the server (or even the client) would load a template file for the layout, and then fill it with the data it reads from something like cdn.mysite.com/prods/00123.json
So I basically don't need to do queries in this case - I jump straight to the file named after the product id. I guess it should be very fast, and yet I would delegate the scalability / caching / geographic distribution to an external strong partner (cdn like akamai, amazon etc.) instead of building my own (expensive and hard to maintain) distributed db server?
I look forward to your suggestions / feedback...especially if it comes to real world experience :)
Thanks!
As per your requirements,
It is better to store product descriptions in a schema free database like MongoDB since your products can have very different fields with wide variation in number of attributes (and corresponding fields). Also such information is written far less often then they are read. MongoDB has collection level write locks which deter write heavy applications if you like to do consistent writes. However the reads in MongoDB are very fast because you dont have to do joins or fetch field values from a EAV schema table. Needless to say, based on your data volume sharding and replication needs to be done in a production environment.
It is better than storing in a flat file since MongoDB's read performance is very good because of memory mapped files and you get replication/sharding as well.
However, if the filesystem (or the filesystem network) provides the security, speed and accessibility provided by the database then storing the data in filesystem is not a bad idea. The traditional db vs flat-file argument does not hold true if the flat files are configured to be served in an efficient manner.
However, you should not store information like shopping cart, checkout transaction, etc in MongoDB since you dont have ACID transactions and frequent writes and updates 'with consistency' is not MongoDB's cup of tea.
Does it make sense to break up the data model of an application into different database systems? For example, the application stores all user data and relationships in a graph database (ideal for storing relationships), while storing other data in a document database, such as CouchDB or MongoDB? This would require the user graph database to reference unique ids in the document databases and vice versa.
Is this over complicating the data model and application? Or is this using the best uses of both types of database systems for scaling your application?
It definitely can make sense and depends fully on the requirements of your application. If you can use other database systems for things in which they are really good at.
Take for example full text search. Of course you can do more or less complex full text searches with a relational database like MySql. But there are systems like e.g. Lucene/Solr which are optimized for such things and can search fast in millions of documents. So you could use these systems for their special task (here: make a nifty full text search), then you return the identifiers and maybe load the relational structured data from the RDBMS.
Or CouchDB. I use couchDB in some projects as a caching systems. In combination with a relational database. Of course I need to care about consistency, but it it's definitely worth the effort. It pushed performance in the projects a lot and decreases for example load on the server from 2 to 0.2. :)
Something like this is for instance called cross-store persistence. As you mentioned you would store certain data in your relational database, social relationships in a graphdb, user-generated data (documents) in a document-db and user provided multimedia files (pictures, audio, video) in a blob-store like S3.
It is mainly about looking at the use-cases and making sure that from wherever you need it you might access the "primary" or index key of each store (back and forth). You can encapsulate the actual lookup in your domain or dao layer.
Some frameworks like the Spring Data projects provide some initial kind of cross-store persistence out of the box, mostly integrating JPA with a different NOSQL datastore. For instance Spring Data Graph allows it to store your entities in JPA and add social graphs or other highly interconnected data as a secondary concern and leverage a graphdb for the typical traversal and other graph operations (e.g. ranking, suggestions etc.)
Another term for this is polyglot persistence.
Here are two contrary positions on the question:
Pro:
"Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase."
http://codemonkeyism.com/nosql-polyglott-persistence/
Con:
"I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency."
http://nosql.mypopescu.com/post/1529816758/why-redis-and-memcached-cassandra-lucene
I work in the promotional products industry. We sell pretty much anything that you can print, embroider, engrave, or any other method to customize. Popular products are pens, mugs, shirts, caps, etc. Because we have such a large variety of products, storing information about these products including all the possible product options, decoration options, and all associated extra charges gets extremely complicated. So much so, that although many have tried, no one has been able to provide industry product data in such a way that you could algorithmically turn the data into an eCommerce store without some degree of data massaging. It seems near impossible to store this information to properly in a relational database. I am curious if MongoDB, or any other NoSQL option, would allow me to model the information in such a way that makes it easier to store and manipulate our product information better than an RDBMS like MySQL. The company I work for is over 100 years old and has been using DB2 on an AS400 for many years. I'll need some good reasons to convince them to go with a non relational DB solution.
A common example product used in our industry is the Bic Clic Stic Pen which has over 20 color options each for barrel and trim colors. Even more colors to choose for silkscreen decoration. Then you can choose additional options for what type of ink to use. There are multiple options for packaging. After all that is selected, you have additional option for rush processing. All of these options may or may not have additional charges that can be based on how many pens you order or how many colors in your decoration. Pricing is usually based on quantity, so ordering 250 pens would be cost more per pen than ordering 1000. Similarly, the extra charge for getting special ink would be cheaper per pen ordered when you order 1000 than 250.
Without wanting to sound harsh, this has the ring of a silver bullet question.
You have an inherently complex business domain. It's not clear to me that a different way of storing your data will have any impact on that complexity - storing documents rather than relational data probably doesn't make it easier to price your pen at $0.02 less if the customer orders more than 250.
I'd recommend focussing on the business domain, and not worrying too much about the storage mechanism. I'm a big fan of Domain Driven Design - this sounds like a perfect case for that approach.
Using a document database won't solve your problem completely, but it probably can help.
If your documents represent the options available on a product and an order for that product, in most cases you will be accessing the document as a whole - it's nothing you can't do with SQL, but a good fit for a document database. Since the structure of the documents is flexible, it is relatively easy to define an object within the document as a complex type to define a particular option or rule without changing the database.
However, that only helps with the data - your real problem is on the UI side. The two documents together map directly to the order form, but whatever method you use to define the options/rules some of the products are going to end up with extremely complex settings pages.
Yes, MongoDB is what you need. It doesn't have a strict documents structure, so you'll be able to create set of models you need and embed them into your product page in any order and combinations you need. Actually its possible to work with this data without describing the real model fields directly, so I (for example) can use fields my Rails application doesn't know about at all.
Also MongoDB is extremely easy to set up for replication and sharding. Also it supports GridFS virtual filesystem, so you can store images for your products with documents which describe them and manipulate them as a single object easily.
You should definitely give it a try.
UPD: Anyway it would be good to keep your RDBMS for financial data and crunching numbers, like grouping reports for the sales analysys and so on. NoSQL bases aren't very good at this.
I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.