As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We are looking at a document db storage solution with fail over clustering, for some read/write intensive application.
We will be having an average of 40K concurrent writes per second written to the db (with peak can go up to 70,000 during) - and may have around almost similiar number of reads happening.
We also need a mechanism for the db to notify about the newly written records (some kind of trigger at db level).
What will be a good option in terms of a proper choice of document db and related capacity planning?
Updated
More details on the expectation.
On an average, we are expecting 40,000 (40K) Number of inserts (new documents) per second across 3-4 databases/document collections.
The peak may go up to 120,000 (120K) Inserts
The Inserts should be readable right away - almost realtime
Along with this, we expect around 5000 updates or deletes per second
Along with this, we also expect 500-600 concurrent queries accessing data. These queries and execution plans are somewhat known, though this might have to be updated, like say, once in a week or so.
The system should support failover clustering on the storage side
if "20,000 concurrent writes" means inserts then I would go for CouchDB and use "_changes" api for triggers. But with 20.000 writes you would need a stable sharding aswell. Then you would better take a look at bigcouch
And if "20.000" concurrent writes consist "mostly" updates I would go for MongoDB for sure, since Its "update in place" is pretty awesome. But then you should handle triggers manually, but using another collection to update in place a general document can be a handy solution. Again be careful about sharding.
Finally I think you cannot select a database with just concurrency, you need to plan the api (how you would retrieve data) then look at options in hand.
I would recommend MongoDB. My requirements wasn't nearly as high as yours but it was reasonably close. Assuming you'll be using C#, I recommend the official MongoDB C# driver and the InsertBatch method with SafeMode turned on. It will literally write data as fast as your file system can handle. A few caveats:
MongoDB does not support triggers (at least the last time I checked).
MongoDB initially caches data to RAM before syncing to disk. If you need real-time needs with durability, you might want to set fsync lower. This will have a significant performance hit.
The C# driver is a little wonky. I don't know if it's just me but I get odd errors whenever I try to run any long running operations with it. The C++ driver is much better and actually faster than the C# driver (or any other driver for that matter).
That being said, I'd also recommend looking into RavenDB as well. It supports everything you're looking for but for the life of me, I couldn't get it to perform anywhere close to Mongo.
The only other database that came close to MongoDB was Riak. Its default Bitcask backend is ridiculously fast as long as you have enough memory to store the keyspace but as I recall it doesn't support triggers.
Membase (and the soon-to-be-released Couchbase Server) will easily handle your needs and provide dynamic scalability (on-the-fly add or remove nodes), replication with failover. The memcached caching layer on top will easily handle 200k ops/sec, and you can linearly scale out with multiple nodes to support getting the data persisted to disk.
We've got some recent benchmarks showing extremely low latency (which roughly equates to high throughput): http://10gigabitethernet.typepad.com/network_stack/2011/09/couchbase-goes-faster-with-openonload.html
Don't know how important it is for you to have a supported Enterprise class product with engineering and QA resources behind it, but that's available too.
Edit: Forgot to mention that there is a built-in trigger interface already, and we're extending it even further to track when data hits disk (persisted) or is replicated.
Perry
We are looking at a document db storage solution with fail over clustering, for some read/write intensive application
Riak with Google's LevelDB backend [here is an awesome benchmark from Google], given enough cache and solid disks is very fast. Depending on a structure of the document, and its size ( you mentioned 2KB ), you would need to benchmark it of course. [ Keep in mind, if you are able to shard your data ( business wise ), you do not have to maintain 40K/s throughput on a single node ]
Another advantage with LevelDB is data compression => storage. If storage is not an issue, you can disable the compression, in which case LevelDB would literally fly.
Riak with secondary indicies allows you to make you data structures as documented as you like => you index only those fields that you care about searching by.
Successful and painless Fail Over is Riak's second name. It really shines here.
We also need a mechanism for the db to notify about the newly written records (some kind of trigger at db level)
You can rely on pre-commit and post-commit hooks in Riak to achieve that behavior, but again, as any triggers, it comes with the price => performance / maintainability.
The Inserts should be readable right away - almost realtime
Riak writes to disk (no async MongoDB surprises) => reliably readable right away. In case you need a better consistency, you can configure Riak's quorum for inserts: e.g. how many nodes should come back before the insert is treated as successful
In general, if fault tolerance / concurrency / fail over / scalability are important to you, I would go with data stores that are written in Erlang, since Erlang successfully solves these problems for many years now.
Related
I am working on an application, where we are writing lots and lots of key value pairs. On production the database size will run into hundreds of Terabytes, even multiple Petabytes. The keys are 20 bytes and the value is maximum 128 KB, and very rarely smaller than 4 KB. Right now we are using MongoDB. The performance is not very good, because obviously there is a lot of overhead going on here. MongoDB writes to the file system, which writes to the LVM, which further writes to a RAID 6 array.
Since our requirement is very basic, I think using a general purpose database system is hitting the performance. I was thinking of implementing a simple database system, where we could put the documents (or 'values') directly to the raw drive (actually the RAID array), and store the keys (and a pointer to where the value lives on the raw drive) in a fast in-memory database backed by an SSD. This will also speed-up the reads, as all there would not be no fragmentation (as opposed to using a filesystem.)
Although a document is rarely deleted, we would still have to maintain a pool of free space available on the device (something that the filesystem would have provided).
My question is, will this really provide any significant improvements? Also, are there any document storage systems that do something like this? Or anything similar, that we can use as a starting poing?
Apache Cassandra jumps to mind. It's the current elect NoSQL solution where massive scaling is concerned. It sees production usage at several large companies with massive scaling requirements. Having worked a little with it, I can say that it requires a little bit of time to rethink your data model to fit how it arranges its storage engine. The famously citied article "WTF is a supercolumn" gives a sound introduction to this. Caveat: Cassandra really only makes sense when you plan on storing huge datasets and distribution with no single point of failure is a mission critical requirement. With the way you've explained your data, it sounds like a fit.
Also, have you looked into redis at all, at least for saving key references? Your memory requirements far outstrip what a single instance would be able to handle but Redis can also be configured to shard. It isn't its primary use case but it sees production use at both Craigslist and Groupon
Also, have you done everything possible to optimize mongo, especially investigating how you could improve indexing? Mongo does save out to disk, but should be relatively performant when optimized to keep the hottest portion of the set in memory if able.
Is it possible to cache this data if its not too transient?
I would totally caution you against rolling your own with this. Just a fair warning. That's not a knock at you or anyone else, its just that I've personally had to maintain custom "data indexes" written by in house developers who got in way over their heads before. At my job we have a massive on disk key-value store that is a major performance bottleneck in our system that was written by a developer who has since separated from the company. It's frustrating to be stuck such a solution among the exciting NoSQL opportunities of today. Projects like the ones I cited above take advantage of the whole strength of the open source community to proof and optimize their use. That isn't something you will be able to attain working on your own solution unless you make a massive investment of time, effort and promotion. At the very least I'd encourage you to look at all your nosql options and maybe find a project you can contribute to rather than rolling your own. Writing a database server itself is definitely a nontrivial task that needs a huge team, especially with the requirements you've given (but should you end up doing so, I wish you luck! =) )
Late answer, but for future reference I think Spider does this
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm wondering if NoSQL is an option for this scenario:
The input are hourly stock data (sku, amount, price and some more specific) from several sources. Older versions will just get droped. So we won't get over 1 mio. data sets in the near future and there won't be any business intelligence queries like in data warehouses. But there will be aggregations, at least for the minimal price of a group of articles which has to get updated if the article with the minimal price of a group is sold out. In addition to these bulk writes on a frequent base there will be single decrements on the amount of an article which can happen at any time.
The database would be part of a service which needs to give fast responses to requests via REST. So there needs to be some kind of caching. There is no need for stong consistency, but durabiltity.
Further wishlist:
should scale well for growing request load
inexpensive technologies in terms of money and complexity (no Oracle cluster)
no proprietary languages (no PL/SQL)
MongoDB with its aggregation framework seems promising. Can you think of alteratives? (I do not stick to NoSQL!)
I would start with Redis, and here is why:
"there needs to be some kind of caching" => and that is what Redis is best at. If for any reason you decide that you need "more", you can add "more", but still keep whatever you already developed in Redis as a cache for that "more"
One Redis is fast. Two Redises are faster. Three Redises are a unit faster than two, etc..
Learning curve is quite flat, and fun => since set theory really is fun
Increments / Decrements / Min / Max is a Redis' native talk
Redis integration with XYZ (you mentioned a need for a REST API) is all over google and github
Redis is honest <= actually one of my favorite features of Redis
MongoDB would work at first, so will ANY other major NoSQL, but why!?
I would go with Redis, and if you decide later you need "more", I would first look at "Redis + SQL db (Postgre/MySQL/etc..)", it will give you both of two worlds => "Caching / Speed" and the "Aggregation Power" in case you aggregations would need to go above and beyond Min/Max/Incr/Decr.
Whoever tells you PostgreSQL "is not fast enough for writing" does not know it.
Whoever tells you that MySQL "is not scalable enough" does not know it (e.g. Facebook runs on MySQL).
As I am already on the roll :) => whoever tells you MongoDB has "replica sets and sharding" does not wish you well, since replica sets and sharding only look sexy from the docs and hype. Once you need to reshard / reorg replica sets, you'll know the price of a wrong shard key selection and magic chunk movements...
Again => Redis FTW!
Well, it seems to me like the MongoDB is the best choice.
It has not only aggregation features but map/reduce queries possibilities for statistics calculation purposes. It may be scaled via replica sets and sharding, has the atomic updates for increments (decrements is just the negative increments).
Alternatives:
CouchDB - not fast enough on reading
Redis - is key/value db. you will need to program articles logic on the application level
MySQL - is not scalable enough
PostgreSQL - could be good alternative if scaled using pgbouncer but is not fast enough on writing
I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.
I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What is the fastest and most stable non-sql database to store big data and process thousands requests during the day (it's for traffic exchange service)? I've found Kdb+ and Berkeley DB. Are they good? Are there other options?
More details...
Each day server processes > 100K visits. For each visit I need to read corresponding stats from DB, write log to DB and update stats in DB, aka 3 operations with DB per visit. Traffic is continuously increasing. Thus DB engine should be fast. From one side DB will be managed by demon written on C, Erlang or any other low-level language. From another side DB will be managed by PHP scripts.
The file system itself is faster and more stable than almost anything else. It stores big data seamlessly and efficiently. The API is very simple.
You can store and retrieve from the file system very, very efficiently.
Since your question is a little thin on "requirements" it's hard to say much more.
What about Redis?
http://code.google.com/p/redis/
Haven't try it yet did read about it and it seem to be a fast and stable enough for data storage.
It also provides you with a decent anti-single-point-failure solution, as far as I understand.
Berkely DB is tried and tested and hardened and is at the heart of many mega-high transaction volume systems. One example is wireless carrier infrastructure that use huge LDAP stores (OpenWave, for example) to process more than 2 BILLION transactions per day. These systems also commonly have something like Oracle in the mix too for point in time recovery, but they use Berkeley DB as replicated caches.
Also, BDB is not limited to key value pairs in the simple sense of scalar values. You can store anything you want in the value, including arbitrary structures/records.
What's wrong with SqlLite? Since you did explicitly state non-sql, Berkeley DB are based on key/value pairs which might not suffice for your needs if you wish to expand the datasets, even more so, how would you make that dataset relate to one another using key/value pairs....
On the other hand, Kdb+, looking at the FAQ on their website is a relational database that can handle SQL via their programming language Q...be aware, if the need to migrate appears, there could be potential hitches, such as incompatible dialects or a query that uses vendor specifics, hence the potential to get locked into that database and not being able to migrate at all...something to bear in mind for later on...
You need to be careful what you decide here and look at it from a long-term perspective, future upgrades, migration to another database, how easy would it be to up-scale, etc
One obvious entry in this category is Intersystems Caché. (Well, obvious to me...) Be aware, though, it's not cheap. (But I don't think Kdb+ is either.)
MongoDB is the fastest and best nosql database. Have a look at this performance benchmark.