As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What kind of projects benefit from using a NoSQL database instead of rdbms wrapped by an ORM?
Examples:
Stackoverflow similiar sites?
Social communities?
forums?
Your question is very general. NoSQL describes a collection of database techniques that are very different from each other. Roughly, there are:
Key-value stores (Redis, Riak)
Triplestores (AllegroGraph)
Column-family stores (Bigtable, Cassandra)
Document-oriented stores (CouchDB, MongoDB)
Graph databases (Neo4j)
A project can benefit from the use of a document database during the development phase of the project, because you won't have to design complex entity-relation diagrams or write complex join queries. I've detailed other uses of document databases in this answer.
If your application needs to handle very large amounts of data, the development phase will likely be longer when you use a specialized NoSQL solution such as Cassandra. However, when your application goes into production, it will greatly benefit from the performance and scalability of Cassandra.
Very generally speaking, if an application has the following requirements:
scale horizontally
work with data model X
perform Y operations
the application will benefit from using a NoSQL solution that is geared towards storing data model X and perform Y operations on the data. If you need more specific answers regarding a certain type of NoSQL database, you'll need to update your question.
Benefits during development (e.g. easier to use than SQL, no licensing costs)?
Benefits in terms of performance (e.g. runs like hell with a million concurrent users)?
What type of NoSQL database?
Update
Key-value stores can only be queried by key in most cases. They're useful to store simple data, such as user sessions, simple profile data or precomputed values and output. Although it is possible to store more complex data in key-value pairs, it burdens the application with the responsibility of maintaining 'manual' indexes in order to perform more advanced queries.
Triplestores are for storing Resource Description Metadata. I don't anything about these stores, except for what Wikipedia tells me, so you'll have to do some research on that.
Column-family stores are built for storing and processing very large amounts of data. They are used by Google's search engine and Facebook's inbox search. The data is queried by MapReduce functions. Although MapReduce functions may be hard to grasp in the beginning, the concept is quite simple. Here's an analogy which (hopefully) explains the concept:
Imagine you have multiple shoe-boxes filled with receipts, and you want to calculate your total expenses. You invite some of your friends over and assign a person to each shoe-box. Each person writes down the total of each receipt in his shoe-box. This process of selecting the required data is the Map part.
When a person has written down the totals of (some of) his receipts, he can sum up these totals. This is the Reduce part and can be repeated multiple times until all receipts have been handled. In the end, all of your friends come together and sum up their total sums, giving you your total expenses. That's the final Reduce step.
The advantage of this approach is that you can have any number of shoe-boxes and you can assign any number of people to a shoe-box and still end up with the same result. Each shoe-box can be seen as a server in the database's network. Each friend can be seem as a thread on the server. With MapReduce you can have your data distributed across many servers and have each server handle part of the query, optimizing the performance of your database.
Document-oriented stores are explained in this question, so I won't discuss them here.
Graph databases are for storing networks of highly connected objects, like the users on a social network for example. These databases are optimized for graph operations, such as finding the shortest path between two nodes, or finding all nodes within three hops from the current node. Such operations are quite expensive on RDBMS systems or other NoSQL databases, but very cheap on graph databases.
NoSQL in the sense of different design approaches, not only the query language. It can have different features. E.g. column oriented databases are used for large amount of data warehouses, which might be used for OLAP.
Similar to my question, there you'll find a lot of resources.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
We're starting a new project and looking for an appropriate storage solution for our case.
Main requirements for the storage are as follows:
Ability to support highly flexible and connected domain
Ability to support queries like "give all children of that item and items linked to that children" in ms
Full text search
Ad hoc analytics
Solid read and write performance
Scalability (as we want to offer a Saas version of our product)
First of all we eliminated all RDBMS, since we have really flexible schema which can also be changed by the customer (add new fields etc.),
so supporting such solution in any RDBMS can become a nightmare...
And we came to NoSQL. We evaluated sevaral NoSQL storage engines and chose 3 most appropriate (as we think).
MongoDB
Pros:
Appropriate to store aggregates with flexible structure (as we have
them)
Scalability/Maturity/Support/Community
Experience with MongoDB on previous project
Drivers, cloud support
Analitycs
Price (it's free)
Cons:
No support for relationships (relly important for us as we have a lot of connected items)
Slow retrieval of connected data (all joins happen in app)
Neo4j:
Pros:
Support of conencted data in modeling, flexibility
Fast retrieval of interconnected data
Drivers, cloud support
Maturity/Support/Comminity (if we compare with other graph Dbs)
Cons:
No support for aggregate storage (we would like to have aggregates in one vertex than in several)
Scalability (as far as I know, now all data is duplicated on other servers)
Analitics ?
Write performance ? (read several blogs where customers complained on its write performance)
Price (it is not free for commercial software)
OrientDB
Pros:
It seems that OrientDB has all the features that we need (aggregates and graphdb in one solution)
Price (looks like is't free)
Cons:
Immaturity (comparing with others)
Really small company behind the technology (In particular one main contributor), so questions about support, known issues etc.
A lot of features, but do they work pretty well
So now, the main dilemma for as is between Neo4j and OrientDB (MongoDb is a third option because its lack of relationships that are really important in our case - this post explains the pitfalls). I've searched for any benchmarks/comparison of these dbs, but all all of them are old. Here is a comparison by features http://vschart.com/compare/neo4j/vs/orientdb. So now we need an advice from people who already used these dbs, what to choose. Thanks in advance.
I think there are interesting trade-offs with each of these:
MongoDB doesn't do graphs;
Neo4j's nodes are flat key-value properties;
OrientDB forces you to choose between graphs and documents (can't do both simultaneously).
So your choice is between a graph store (neo4j or orient) and a document store (mongo or orient). My sense is that MongoDB is the leading document store and Neo4j is the leading graph database which would lead me to pick one of thse. But since connectivity is important, I'd lean towards the graph database and take Neo4j.
Neo4j's scalability is proven: it's in use for graphs larger than Facebook's and by enormous companies like Walmart and EBay. So if your problem is anywhere between 0-120% of Facebook's social graph, Neo4j has you covered. Write throughput is fine with Neo4j - I get in excess of 2,000 proper ACID Transactions per second on a laptop and I can easily queue writes to multiply that out.
Everything else is pretty equal: you can choose to pay for any of these or use them freely under their open source licenses (including Neo4j if you can work with GPL/AGPL). Neo4j's paid licenses have great support (up 24x7x365, 1 hour turnaround worldwide) versus OrientDB's rather lacklustre support (4 hour turnaround in the EU daytime only), and I imagine MongoDB has good support too (though I have not checked up on it).
In short, there's a reason Neo4j is the top database for connected data: it kicks ass!
Jim
To correct some misconceptions regarding mongoDB
Relations are supported, by either linking to other documents or embedding them. Please see the Data Modeling Introduction in the mongoDB docs for details. It may be that you are forced to trade normalization against speed, though. However, there are use cases in which embedding is the better solution compared to relations. Think of orders: When embedding order items and their price, you do not need to have a price history table for each and every product ever sold.
What is not supported are JOINs. Which you can circumvent by embedding documents, as mentioned above.
MongoDB can be used for tree structures. Please see Model Tree Structures with Materialized Paths for details. This approach seems to be the most appropriate way to implement a tree structure for the mentioned use case. An alternative may be an array of ancestors, depending on your needs.
That being said, mongoDB may fail in one of the basic requirements, though this really depends on how you define it: ad hoc analysis. My suggestion would be to model the intended data structure using a document oriented approach (in opposite of putting a relational approach on a document oriented database) and prototype one of the possible analysis use cases with dummy data.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm naturally a front-end guy but database design and back-end dev has piqued my interest the last few days and I'm lingering very confused. I'd like to fully grasp these concepts so my head doesn't hurt anymore.
I'm used to dealing with an ORM like active record. When I imagine my user objects through rails I picture an object of a person being related to a row in the people table. Ok basic.
So, I read that non-relational databases like mongodb aren't just cool because they're "fast with big data" but they're also cool because they apparently make developing more natural with an oop language (why?). Then I also read that most design patterns probably aren't truly relational. Ok, this is where I get lost.
1) What are some fundamental examples of relational design and non-relational design?
2) similar to above, what are examples of structured data vs unstructured (is this just rephrasing above?)
So, given those things I feel (in my immediate ignorance) that almost every type of project I've attempted to model against has been relational. But maybe I'm just using semantics over technicality. For example, posts and comments. Relational to each other. Add users in there. it seems most apps these days have data that is always useful to reach through other data/objects. That's relational isn't it?
How about something describing something less typical.
Let's say I was building a workout tracker app. I have users, exercises, workouts, routines and log_entries.
I create a routine to group workouts. A workout is a group of exercises. I record my reps and weight through log entries for my exercises. Is this relational data or non relational? Would mongo be great for modeling this or awful?
I hear things like statistics come into play. How does that affect the above example? What statistics are people usually speaking of?
Let's say I added tracking other things, like a users weight, height, body fat and so on. How does that affect things?
Thank you for taking the time to help me understand.
Edit: can anyone explain why it may be easier to develop for one over the other. Is using something like mongo more agile both because it "clicks" more once you "ge it" and also because you don't need to run migrations?
Another thing, when using an abstraction like an ORM - does it really matter? If so when? To me the initial value would be the ease of querying data and modeling my objects. Whatever lets me do that easier is what I'd be happy with. I truthfully do find myself scratching my head a lot when trying to model data.
OK, let me give it a stab...
(DISCLAIMER: I have very little practical experience with non-relational databases, so this will be a bit "relation-centric".)
Relational databases use "cookie-cutters" (tables) to produce large numbers of uniform "cookies" (rows in these tables). Non-relational databases give you greater latitude as to how to shape your "cookies", but you loose the power of set-based SQL. A little bit like static vs. dynamic typing.
1) What are some fundamental examples of relational design and non-relational design?
A tuple is an ordered list of attributes, a relation is a set of tuples of the same "shape", a table is a physical representation of a relation. If something fits this paradigm, it is "relational".
Financial transactions tend to be relational, while (say) text documents don't. There is a lot of gray area in between.
2) similar to above, what are examples of structured data vs unstructured (is this just rephrasing above?)
I don't know. How do you define "structured"?
it seems most apps these days have data that is always useful to reach through other data/objects. That's relational isn't it?
Sure. Relational databases don't have "pointers" per-se, but foreign keys fulfill essentially the same role. You can always JOIN the data the way you see fit, although JOINS are usually done over FKs.
Let's say I was building a workout tracker app. I have users, exercises, workouts, routines and log_entries.
It looks like this would fit nicely in the relational model.
Let's say I added tracking other things, like a users weight, height, body fat and so on. How does that affect things?
You'll probably need additional columns and/or tables. This still looks relational to me.
What statistics are people usually speaking of?
Perhaps they are talking about index statistics, that help the cost-based query optimizer pick a better execution plan?
why it may be easier to develop for one over the other
Right tool for the job, remember? If you know the structure of your data in advance and it fits the relational model, use a relational database.
Also, relational databases tend to be extraordinarily good at managing huge amounts of data, accessed concurrently by many users (though this might be true for some non-relational databases as well).
Another thing, when using an abstraction like an ORM - does it really matter?
ORMs tend to make easy things easier and hard things harder. Again, it's a matter of balance. For me, being able to write a finely-tuned SQL that extracts exactly the fields I need trumps the tedium of writing boiler-plate CRUD, so I'm not a very big fan of ORMs. Other people might disagree.
I truthfully do find myself scratching my head a lot when trying to model data.
Well, it is a big topic. If you are willing to put the time and effort in it, start with the ERwin Methods Guide, and also take a look at: Use The Index, Luke!
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm wondering if NoSQL is an option for this scenario:
The input are hourly stock data (sku, amount, price and some more specific) from several sources. Older versions will just get droped. So we won't get over 1 mio. data sets in the near future and there won't be any business intelligence queries like in data warehouses. But there will be aggregations, at least for the minimal price of a group of articles which has to get updated if the article with the minimal price of a group is sold out. In addition to these bulk writes on a frequent base there will be single decrements on the amount of an article which can happen at any time.
The database would be part of a service which needs to give fast responses to requests via REST. So there needs to be some kind of caching. There is no need for stong consistency, but durabiltity.
Further wishlist:
should scale well for growing request load
inexpensive technologies in terms of money and complexity (no Oracle cluster)
no proprietary languages (no PL/SQL)
MongoDB with its aggregation framework seems promising. Can you think of alteratives? (I do not stick to NoSQL!)
I would start with Redis, and here is why:
"there needs to be some kind of caching" => and that is what Redis is best at. If for any reason you decide that you need "more", you can add "more", but still keep whatever you already developed in Redis as a cache for that "more"
One Redis is fast. Two Redises are faster. Three Redises are a unit faster than two, etc..
Learning curve is quite flat, and fun => since set theory really is fun
Increments / Decrements / Min / Max is a Redis' native talk
Redis integration with XYZ (you mentioned a need for a REST API) is all over google and github
Redis is honest <= actually one of my favorite features of Redis
MongoDB would work at first, so will ANY other major NoSQL, but why!?
I would go with Redis, and if you decide later you need "more", I would first look at "Redis + SQL db (Postgre/MySQL/etc..)", it will give you both of two worlds => "Caching / Speed" and the "Aggregation Power" in case you aggregations would need to go above and beyond Min/Max/Incr/Decr.
Whoever tells you PostgreSQL "is not fast enough for writing" does not know it.
Whoever tells you that MySQL "is not scalable enough" does not know it (e.g. Facebook runs on MySQL).
As I am already on the roll :) => whoever tells you MongoDB has "replica sets and sharding" does not wish you well, since replica sets and sharding only look sexy from the docs and hype. Once you need to reshard / reorg replica sets, you'll know the price of a wrong shard key selection and magic chunk movements...
Again => Redis FTW!
Well, it seems to me like the MongoDB is the best choice.
It has not only aggregation features but map/reduce queries possibilities for statistics calculation purposes. It may be scaled via replica sets and sharding, has the atomic updates for increments (decrements is just the negative increments).
Alternatives:
CouchDB - not fast enough on reading
Redis - is key/value db. you will need to program articles logic on the application level
MySQL - is not scalable enough
PostgreSQL - could be good alternative if scaled using pgbouncer but is not fast enough on writing
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Opinion-based Update the question so it can be answered with facts and citations by editing this post.
Improve this question
What is the pros and cons of MongoDB (document-based), HBase (column-based) and Neo4j (objects graph)?
I'm particularly interested to know some of the typical use cases for each one.
What are good examples of problems that graphs can solve better than the alternative?
Maybe any Slideshare or Scribd worthy presentation?
MongoDB
Scalability: Highly available and consistent but sucks at relations and many distributed writes. It's primary benefit is storing and indexing schemaless documents. Document size is capped at 4mb and indexing only makes sense for limited depth. See http://www.paperplanes.de/2010/2/25/notes_on_mongodb.html
Best suited for: Tree structures with limited depth
Use Cases: Diverse Type Hierarchies, Biological Systematics, Library Catalogs
Neo4j
Scalability: Highly available but not distributed. Powerful traversal framework for high-speed traversals in the node space. Limited to graphs around several billion nodes/relationships. See http://highscalability.com/neo4j-graph-database-kicks-buttox
Best suited for: Deep graphs with unlimited depth and cyclical, weighted connections
Use Cases: Social Networks, Topological analysis, Semantic Web Data, Inferencing
HBase
Scalability: Reliable, consistent storage in the petabytes and beyond. Supports very large numbers of objects with a limited set of sparse attributes. Works in tandem with Hadoop for large data processing jobs. http://www.ibm.com/developerworks/opensource/library/os-hbase/index.html
Best suited for: directed, acyclic graphs
Use Cases: Log analysis, Semantic Web Data, Machine Learning
I know this might seem like an odd place to point to but, Heroku has recently gone nuts with their noSQL offerings and have an OK overview of many of the current projects. It is in no way a Slideshare press but it will help you start the comparison process:
http://blog.heroku.com/archives/2010/7/20/nosql/?utm_medium=email&utm_source=EmailBlast&utm_content=619506254&utm_campaign=HerokuSeptemberNewsletter-VersionB&utm_term=NoSQLHerokuandYou
Checkout this for at glance comparison of NoSQL dbs:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
MongoDB:
MongoDB is document database unlike Relational database. The document stores semi structured data like JSON object ( schema free)
Key features:
Schema can change over evolution of application
Full indexing
Load balancing & Data sharding
Data replication
Consistency & Partitioning in CAP theory ( Consistency-Availability-Partitioning)
When to use:
Real time analytics
High speed logging
Semi structured data management
When not to use:
Highly transactional applications with strong ACID properties ( Atomicity, Consistency, Isolation & Durability). RDBMS is preferred in this use case.
Operating on data sets involving relations - foreign keys etc
HBASE:
HBase is an open source, non-relational, distributed column family database
Key features:
It provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection)
Supports variable schema where each row is different
Can serve as the input and output for MapReduce job
Compression, in-memory operation, and Bloom filters on a per-column (A data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set)
5.Achieve CP on CAP
When to use HBase:
If you’re loading data by key, searching data by key (or range), serving data by key, querying data by key
Storing data by row that doesn’t conform well to a schema (variable schema)
When not to use HBase:
For relational analytics
Full table scans
Data to be aggregated, analyzed by rows instead of columns
Neo4j:
Neo4j is graph database using Property Graph Data Model (Data is stored as a graph and nodes & relationships with properties)
Key features:
Supports full ACID(Atomicity, Consistency, Isolation and Durability) rules
Supports Indexes by using Apache Lucence
Schema free, bottom-up data model design
High scalability has been achieved due to compact storage and memory caching available for graphs
When to use:
Master data management
Network and IT Operations
Real time recommendations
Fraud detection
Social network (like facebook)
When not to use:
Bulk queries/Scans
If your application requires Partitioning & Sharding of data
Have a look at comparison of various NoSQL technologies in this article
Sources:
Wiki, Slide share, Cloudera,Tutorials Point,Neo4j
You could also evaluate a Multi-Model DBMS, as the second generation of NoSQL product. With a Multi-Model you don't have all the compromises on choosing just one model, but rather more than one model.
The first multi-model NoSQL is OrientDB.
Pretty decent article here on MongoDB and NoRM (.net extensions for MongoDB)
http://lukencode.com/2010/07/09/getting-started-with-mongodb-and-norm/
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What is the fastest and most stable non-sql database to store big data and process thousands requests during the day (it's for traffic exchange service)? I've found Kdb+ and Berkeley DB. Are they good? Are there other options?
More details...
Each day server processes > 100K visits. For each visit I need to read corresponding stats from DB, write log to DB and update stats in DB, aka 3 operations with DB per visit. Traffic is continuously increasing. Thus DB engine should be fast. From one side DB will be managed by demon written on C, Erlang or any other low-level language. From another side DB will be managed by PHP scripts.
The file system itself is faster and more stable than almost anything else. It stores big data seamlessly and efficiently. The API is very simple.
You can store and retrieve from the file system very, very efficiently.
Since your question is a little thin on "requirements" it's hard to say much more.
What about Redis?
http://code.google.com/p/redis/
Haven't try it yet did read about it and it seem to be a fast and stable enough for data storage.
It also provides you with a decent anti-single-point-failure solution, as far as I understand.
Berkely DB is tried and tested and hardened and is at the heart of many mega-high transaction volume systems. One example is wireless carrier infrastructure that use huge LDAP stores (OpenWave, for example) to process more than 2 BILLION transactions per day. These systems also commonly have something like Oracle in the mix too for point in time recovery, but they use Berkeley DB as replicated caches.
Also, BDB is not limited to key value pairs in the simple sense of scalar values. You can store anything you want in the value, including arbitrary structures/records.
What's wrong with SqlLite? Since you did explicitly state non-sql, Berkeley DB are based on key/value pairs which might not suffice for your needs if you wish to expand the datasets, even more so, how would you make that dataset relate to one another using key/value pairs....
On the other hand, Kdb+, looking at the FAQ on their website is a relational database that can handle SQL via their programming language Q...be aware, if the need to migrate appears, there could be potential hitches, such as incompatible dialects or a query that uses vendor specifics, hence the potential to get locked into that database and not being able to migrate at all...something to bear in mind for later on...
You need to be careful what you decide here and look at it from a long-term perspective, future upgrades, migration to another database, how easy would it be to up-scale, etc
One obvious entry in this category is Intersystems Caché. (Well, obvious to me...) Be aware, though, it's not cheap. (But I don't think Kdb+ is either.)
MongoDB is the fastest and best nosql database. Have a look at this performance benchmark.