As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I've read some article say that RDBMS such as MySQL is not good at scalable,but NoSQL such as MongoDB can shard well.
I want to know which feature that RDBMS provided make itself can not shard well.
Most RDBMS systems guarantee the so-called ACID properties. Most of these properties boil down to consistency; every modification on your data will transfer your database from one consistent state to another consistent state.
For example, if you update multiple records in a single transaction, the database will ensure that the records involved will not be modified by other queries, as long as the transaction hasn't completed. So during the transaction, multiple tables may be locked for modification. If those tables are spread across multiple shards/servers, it'll take more time to acquire the appropriate locks, update the data and release the locks.
The CAP theorem states that a distributed (i.e. scalable) system cannot guarantee all of the following properties at the same time:
Consistency
Availability
Partition tolerance
RDBMS systems guarantee consistency. Sharding makes the system tolerant to partitioning. From the theorem follows that the system can therefor not guarantee availability. That's why a standard RDBMS cannot scale very well: it won't be able to guarantee availability. And what good is a database if you can't access it?
NoSQL databases drop consistency in favor of availability. That's why they are better at scalability.
I'm not saying RDBMS systems cannot scale at all, it's just harder. This article outlines some of the possible sharding schemes, and the problems you may encounter. Most approaches sacrifice consistency, which is one of the most important features of RDBMS systems, and which prevents it from scaling.
Why NoSQL dudes and dudettes don't like joins: http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/
Queries involving multiple shards are complex (f.e. JOINs between tables in different shards)
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm wondering if NoSQL is an option for this scenario:
The input are hourly stock data (sku, amount, price and some more specific) from several sources. Older versions will just get droped. So we won't get over 1 mio. data sets in the near future and there won't be any business intelligence queries like in data warehouses. But there will be aggregations, at least for the minimal price of a group of articles which has to get updated if the article with the minimal price of a group is sold out. In addition to these bulk writes on a frequent base there will be single decrements on the amount of an article which can happen at any time.
The database would be part of a service which needs to give fast responses to requests via REST. So there needs to be some kind of caching. There is no need for stong consistency, but durabiltity.
Further wishlist:
should scale well for growing request load
inexpensive technologies in terms of money and complexity (no Oracle cluster)
no proprietary languages (no PL/SQL)
MongoDB with its aggregation framework seems promising. Can you think of alteratives? (I do not stick to NoSQL!)
I would start with Redis, and here is why:
"there needs to be some kind of caching" => and that is what Redis is best at. If for any reason you decide that you need "more", you can add "more", but still keep whatever you already developed in Redis as a cache for that "more"
One Redis is fast. Two Redises are faster. Three Redises are a unit faster than two, etc..
Learning curve is quite flat, and fun => since set theory really is fun
Increments / Decrements / Min / Max is a Redis' native talk
Redis integration with XYZ (you mentioned a need for a REST API) is all over google and github
Redis is honest <= actually one of my favorite features of Redis
MongoDB would work at first, so will ANY other major NoSQL, but why!?
I would go with Redis, and if you decide later you need "more", I would first look at "Redis + SQL db (Postgre/MySQL/etc..)", it will give you both of two worlds => "Caching / Speed" and the "Aggregation Power" in case you aggregations would need to go above and beyond Min/Max/Incr/Decr.
Whoever tells you PostgreSQL "is not fast enough for writing" does not know it.
Whoever tells you that MySQL "is not scalable enough" does not know it (e.g. Facebook runs on MySQL).
As I am already on the roll :) => whoever tells you MongoDB has "replica sets and sharding" does not wish you well, since replica sets and sharding only look sexy from the docs and hype. Once you need to reshard / reorg replica sets, you'll know the price of a wrong shard key selection and magic chunk movements...
Again => Redis FTW!
Well, it seems to me like the MongoDB is the best choice.
It has not only aggregation features but map/reduce queries possibilities for statistics calculation purposes. It may be scaled via replica sets and sharding, has the atomic updates for increments (decrements is just the negative increments).
Alternatives:
CouchDB - not fast enough on reading
Redis - is key/value db. you will need to program articles logic on the application level
MySQL - is not scalable enough
PostgreSQL - could be good alternative if scaled using pgbouncer but is not fast enough on writing
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I was looking trough their website and I can't understand the problem that they are solving. What is the problem with the relational DB? How can be data stored in JSON documents any faster than the data stored in an SQL database?
In a fully normalized relational DB, every insertion will often require several look-ups in other tables (and its own table) to maintain data integrity (FKs). This is generally a good thing, but takes time. It's also often the case that you need to update several rows in different tables at once, leading to even more look-ups and transactional overhead.
Querying the database will also often need to look at many different tables and merge them.
A mongoDB document on the other hand is a much simpler construct. Every collection is like a big un-normalized table but where all fields are optional (but still indexable), so there is very little space overhead (compared to a relational DB with the same setup).
It offers flexibility and speed at the cost of complex querying and removing data integrity logic from the server to the client (database client, not end user client ;)).
Both has its uses, but the question that has normally been "do we need something different from a Relational DB?" should nowadays be "do we need something more complex than a document DB?" imo, and the vast majority of projects will not.
I think if you're happy with relational database for you task, you needn't switch to mongoDb. I think mongodb is supposed to make scaling out simpler than for rdbms. For some tasks I think I think you can get benefits from flexible schema in mongodb as well. I think it mostly make sense to discuss using some database for a concrete task.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We are looking at a document db storage solution with fail over clustering, for some read/write intensive application.
We will be having an average of 40K concurrent writes per second written to the db (with peak can go up to 70,000 during) - and may have around almost similiar number of reads happening.
We also need a mechanism for the db to notify about the newly written records (some kind of trigger at db level).
What will be a good option in terms of a proper choice of document db and related capacity planning?
Updated
More details on the expectation.
On an average, we are expecting 40,000 (40K) Number of inserts (new documents) per second across 3-4 databases/document collections.
The peak may go up to 120,000 (120K) Inserts
The Inserts should be readable right away - almost realtime
Along with this, we expect around 5000 updates or deletes per second
Along with this, we also expect 500-600 concurrent queries accessing data. These queries and execution plans are somewhat known, though this might have to be updated, like say, once in a week or so.
The system should support failover clustering on the storage side
if "20,000 concurrent writes" means inserts then I would go for CouchDB and use "_changes" api for triggers. But with 20.000 writes you would need a stable sharding aswell. Then you would better take a look at bigcouch
And if "20.000" concurrent writes consist "mostly" updates I would go for MongoDB for sure, since Its "update in place" is pretty awesome. But then you should handle triggers manually, but using another collection to update in place a general document can be a handy solution. Again be careful about sharding.
Finally I think you cannot select a database with just concurrency, you need to plan the api (how you would retrieve data) then look at options in hand.
I would recommend MongoDB. My requirements wasn't nearly as high as yours but it was reasonably close. Assuming you'll be using C#, I recommend the official MongoDB C# driver and the InsertBatch method with SafeMode turned on. It will literally write data as fast as your file system can handle. A few caveats:
MongoDB does not support triggers (at least the last time I checked).
MongoDB initially caches data to RAM before syncing to disk. If you need real-time needs with durability, you might want to set fsync lower. This will have a significant performance hit.
The C# driver is a little wonky. I don't know if it's just me but I get odd errors whenever I try to run any long running operations with it. The C++ driver is much better and actually faster than the C# driver (or any other driver for that matter).
That being said, I'd also recommend looking into RavenDB as well. It supports everything you're looking for but for the life of me, I couldn't get it to perform anywhere close to Mongo.
The only other database that came close to MongoDB was Riak. Its default Bitcask backend is ridiculously fast as long as you have enough memory to store the keyspace but as I recall it doesn't support triggers.
Membase (and the soon-to-be-released Couchbase Server) will easily handle your needs and provide dynamic scalability (on-the-fly add or remove nodes), replication with failover. The memcached caching layer on top will easily handle 200k ops/sec, and you can linearly scale out with multiple nodes to support getting the data persisted to disk.
We've got some recent benchmarks showing extremely low latency (which roughly equates to high throughput): http://10gigabitethernet.typepad.com/network_stack/2011/09/couchbase-goes-faster-with-openonload.html
Don't know how important it is for you to have a supported Enterprise class product with engineering and QA resources behind it, but that's available too.
Edit: Forgot to mention that there is a built-in trigger interface already, and we're extending it even further to track when data hits disk (persisted) or is replicated.
Perry
We are looking at a document db storage solution with fail over clustering, for some read/write intensive application
Riak with Google's LevelDB backend [here is an awesome benchmark from Google], given enough cache and solid disks is very fast. Depending on a structure of the document, and its size ( you mentioned 2KB ), you would need to benchmark it of course. [ Keep in mind, if you are able to shard your data ( business wise ), you do not have to maintain 40K/s throughput on a single node ]
Another advantage with LevelDB is data compression => storage. If storage is not an issue, you can disable the compression, in which case LevelDB would literally fly.
Riak with secondary indicies allows you to make you data structures as documented as you like => you index only those fields that you care about searching by.
Successful and painless Fail Over is Riak's second name. It really shines here.
We also need a mechanism for the db to notify about the newly written records (some kind of trigger at db level)
You can rely on pre-commit and post-commit hooks in Riak to achieve that behavior, but again, as any triggers, it comes with the price => performance / maintainability.
The Inserts should be readable right away - almost realtime
Riak writes to disk (no async MongoDB surprises) => reliably readable right away. In case you need a better consistency, you can configure Riak's quorum for inserts: e.g. how many nodes should come back before the insert is treated as successful
In general, if fault tolerance / concurrency / fail over / scalability are important to you, I would go with data stores that are written in Erlang, since Erlang successfully solves these problems for many years now.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What kind of projects benefit from using a NoSQL database instead of rdbms wrapped by an ORM?
Examples:
Stackoverflow similiar sites?
Social communities?
forums?
Your question is very general. NoSQL describes a collection of database techniques that are very different from each other. Roughly, there are:
Key-value stores (Redis, Riak)
Triplestores (AllegroGraph)
Column-family stores (Bigtable, Cassandra)
Document-oriented stores (CouchDB, MongoDB)
Graph databases (Neo4j)
A project can benefit from the use of a document database during the development phase of the project, because you won't have to design complex entity-relation diagrams or write complex join queries. I've detailed other uses of document databases in this answer.
If your application needs to handle very large amounts of data, the development phase will likely be longer when you use a specialized NoSQL solution such as Cassandra. However, when your application goes into production, it will greatly benefit from the performance and scalability of Cassandra.
Very generally speaking, if an application has the following requirements:
scale horizontally
work with data model X
perform Y operations
the application will benefit from using a NoSQL solution that is geared towards storing data model X and perform Y operations on the data. If you need more specific answers regarding a certain type of NoSQL database, you'll need to update your question.
Benefits during development (e.g. easier to use than SQL, no licensing costs)?
Benefits in terms of performance (e.g. runs like hell with a million concurrent users)?
What type of NoSQL database?
Update
Key-value stores can only be queried by key in most cases. They're useful to store simple data, such as user sessions, simple profile data or precomputed values and output. Although it is possible to store more complex data in key-value pairs, it burdens the application with the responsibility of maintaining 'manual' indexes in order to perform more advanced queries.
Triplestores are for storing Resource Description Metadata. I don't anything about these stores, except for what Wikipedia tells me, so you'll have to do some research on that.
Column-family stores are built for storing and processing very large amounts of data. They are used by Google's search engine and Facebook's inbox search. The data is queried by MapReduce functions. Although MapReduce functions may be hard to grasp in the beginning, the concept is quite simple. Here's an analogy which (hopefully) explains the concept:
Imagine you have multiple shoe-boxes filled with receipts, and you want to calculate your total expenses. You invite some of your friends over and assign a person to each shoe-box. Each person writes down the total of each receipt in his shoe-box. This process of selecting the required data is the Map part.
When a person has written down the totals of (some of) his receipts, he can sum up these totals. This is the Reduce part and can be repeated multiple times until all receipts have been handled. In the end, all of your friends come together and sum up their total sums, giving you your total expenses. That's the final Reduce step.
The advantage of this approach is that you can have any number of shoe-boxes and you can assign any number of people to a shoe-box and still end up with the same result. Each shoe-box can be seen as a server in the database's network. Each friend can be seem as a thread on the server. With MapReduce you can have your data distributed across many servers and have each server handle part of the query, optimizing the performance of your database.
Document-oriented stores are explained in this question, so I won't discuss them here.
Graph databases are for storing networks of highly connected objects, like the users on a social network for example. These databases are optimized for graph operations, such as finding the shortest path between two nodes, or finding all nodes within three hops from the current node. Such operations are quite expensive on RDBMS systems or other NoSQL databases, but very cheap on graph databases.
NoSQL in the sense of different design approaches, not only the query language. It can have different features. E.g. column oriented databases are used for large amount of data warehouses, which might be used for OLAP.
Similar to my question, there you'll find a lot of resources.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What is the fastest and most stable non-sql database to store big data and process thousands requests during the day (it's for traffic exchange service)? I've found Kdb+ and Berkeley DB. Are they good? Are there other options?
More details...
Each day server processes > 100K visits. For each visit I need to read corresponding stats from DB, write log to DB and update stats in DB, aka 3 operations with DB per visit. Traffic is continuously increasing. Thus DB engine should be fast. From one side DB will be managed by demon written on C, Erlang or any other low-level language. From another side DB will be managed by PHP scripts.
The file system itself is faster and more stable than almost anything else. It stores big data seamlessly and efficiently. The API is very simple.
You can store and retrieve from the file system very, very efficiently.
Since your question is a little thin on "requirements" it's hard to say much more.
What about Redis?
http://code.google.com/p/redis/
Haven't try it yet did read about it and it seem to be a fast and stable enough for data storage.
It also provides you with a decent anti-single-point-failure solution, as far as I understand.
Berkely DB is tried and tested and hardened and is at the heart of many mega-high transaction volume systems. One example is wireless carrier infrastructure that use huge LDAP stores (OpenWave, for example) to process more than 2 BILLION transactions per day. These systems also commonly have something like Oracle in the mix too for point in time recovery, but they use Berkeley DB as replicated caches.
Also, BDB is not limited to key value pairs in the simple sense of scalar values. You can store anything you want in the value, including arbitrary structures/records.
What's wrong with SqlLite? Since you did explicitly state non-sql, Berkeley DB are based on key/value pairs which might not suffice for your needs if you wish to expand the datasets, even more so, how would you make that dataset relate to one another using key/value pairs....
On the other hand, Kdb+, looking at the FAQ on their website is a relational database that can handle SQL via their programming language Q...be aware, if the need to migrate appears, there could be potential hitches, such as incompatible dialects or a query that uses vendor specifics, hence the potential to get locked into that database and not being able to migrate at all...something to bear in mind for later on...
You need to be careful what you decide here and look at it from a long-term perspective, future upgrades, migration to another database, how easy would it be to up-scale, etc
One obvious entry in this category is Intersystems Caché. (Well, obvious to me...) Be aware, though, it's not cheap. (But I don't think Kdb+ is either.)
MongoDB is the fastest and best nosql database. Have a look at this performance benchmark.