Implement Lucene on Existing .NET / SQL Server stack with multiple webservers - store indexes in the database?

Implement Lucene on Existing .NET / SQL Server stack with multiple webservers - store indexes in the database? - lucene.net

This article offered me a huge amount of information:
Implement Lucene on Existing .NET / SQL Server stack with multiple webservers
I'd like to follow on from this by asking about the notion of implementing a Lucene Directory that would persist the indexes to the database (in my case SQL Server) - if anyone has a SWAG on effort that would be helpful.
I can see that the Java realm has this (e.g. Compass), and I'm really hoping the Stackoverflow folks might have considered this to? Any feedback would be appreciated.
My rookie thinking is that persisting indexes to the DB would be a way to solve for the 'distribution' problem. So instead of implementing messaging (not possible for my software because of deployment restrictions), or scheduling (would be ok'ish - product folks always get jumpy in making decisions about how 'current' indexed data has to be), the IndexReader reopen() would efficiently update the index snapshot on whichever server node.
Does this work if DB concurrency/load is not the heart of the problem being solved? - our use is focused around facilitating different data analysis on fields which in turns facilitates different forms of matching.
Our deployment architecture/restrictions do not really allow us to insist on dedicated servers ala SOLR, so this notion of distribution has been discounted by us.

How much index changes do you await? When do you want to read in the index? (On application startup?) Putting the index into the database and "downloading" it on index creation might consume too much resources.
Not sure about your deployment restrictions, but can you have a shared file space for your machines (e.g. SMB/NFS share or similar, or even a SAN-based solution)?

I would be a bit afraid of performance issues with the indexes in the db. Have a look at Elasticsearch. It's the successor of compass. It requires Java, but has a very neat REST interface for your .NET solution. Elasticsearch supports distribution and replication between several nodes. You can run it on the webserver nodes.

This solution will kill performance of the index, since it has to retrieve it from the DB.
I would highly recommend moving to a newer/better alternative, that is Solr (using Solr.NET for example) or ElasticSearch (using NEST)
Solr is a high level interface/manager for Lucene indexes, with a simplified configuration, clustering, replication, etc. solved for you. The nice thing is that if you have some exp. with Lucene, this will not be such a big step
ElasticSearch is a different approach but it's not hard to learn.

Related

When to use CouchDB over MongoDB and vice versa

I am stuck between these two NoSQL databases.
In my project, I will be creating a database within a database. For example, I need a solution to create dynamic tables.
So users can create tables with columns and rows. I think either MongoDB or CouchDB will be good for this, but I am not sure which one. I will also need efficient paging as well.

Of C, A & P (Consistency, Availability & Partition tolerance) which 2 are more important to you? Quick reference, the Visual Guide To NoSQL Systems
MongodB : Consistency and Partition Tolerance
CouchDB : Availability and Partition Tolerance
A blog post, Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison has 'Best used' scenarios for each NoSQL database compared. Quoting the link,
MongoDB: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
CouchDB : For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
A recent (Feb 2012) and more comprehensive comparison by Riyad Kalla,
MongoDB : Master-Slave Replication ONLY
CouchDB : Master-Master Replication
A blog post (Oct 2011) by someone who tried both, A MongoDB Guy Learns CouchDB commented on the CouchDB's paging being not as useful.
A dated (Jun 2009) benchmark by Kristina Chodorow (part of team behind MongoDB),
I'd go for MongoDB.

The answers above all overcomplicate the story.
If you plan to have a mobile component, or need desktop users to work offline and then sync their work to a server you need CouchDB.
If your code will run only on the server then go with MongoDB
That's it. Unless you need CouchDB's (awesome) ability to replicate to mobile and desktop devices, MongoDB has the performance, community and tooling advantage at present.

Very old question but it's on top of Google and I don't quite like the answers I see so here's my own.
There's much more to Couchdb than the ability to develop CouchApps. Most people use CouchDb in a classical 3-tiers web architecture.
In practice the deciding factor for most people will be the fact that MongoDb allows ad-hoc querying with a SQL like syntax while CouchDb doesn't (you've got to create map/reduce views which turns some people off even though creating these views is Rapid Application Development friendly - they have nothing to do with stored procedures).
To address points raised in the accepted answer : CouchDb has a great versionning system, but it doesn't mean that it is only suited (or more suited) for places where versionning is important. Also, couchdb is heavy-write friendly thanks to its append-only nature (writes operations return in no time while guaranteeing that no data will ever be lost).
One very important thing that is not mentioned by anyone is the fact that CouchDb relies on b-tree indexes. This means that whether you have 1 "row" or 20 billions, the querying time will always remain below 10ms. This is a game changer which makes CouchDb a low-latency and read-friendly database, and this really shouldn't be overlooked.
To be fair and exhaustive the advantage MongoDb has over CouchDb is tooling and marketing. They have first-class citizen tools for all major languages and platforms making the on-boarding easy and this added to their adhoc querying makes the transition from SQL even easier.
CouchDb doesn't have this level of tooling - even though there are many libraries available today - but CouchDb is exposed as an HTTP API and it is therefore quite easy to create a wrapper in your favorite language to talk with it. I personally like this approach as it avoids bloat and allows you to only take what you want (interface segregation principle).
So I'd say using one or the other is largely a matter of comfort and preference with their paradigms. CouchDb approach "just fits", for certain people, but if after learning about the database features (in the exhaustive official guide) you don't have your "hell yeah" moment, you should probably move on.
I'd discourage using CouchDb if you just want to use "the right tool for the right job". because you'll find out that you can't just use it that way and you'll end up being pissed and writing blog posts such as "Where are joins in CouchDb ?" and "Where is transaction management ?". Indeed Couchdb is - paradoxically - very transparent but at the same time requires a paradigm shift and a change in the way you approach problems to really shine (and really work).
But once you've done that it really pays off. I'd personally need very strong reasons or a major deal breaker on a project to choose another database, but so far I haven't met any.
Update December 2022:
Since this post is still getting a lot of views, I felt important to inform people that I have recently moved to using MongoDB as my daily driver, while keeping CouchDB in my toolbelt for specialized cases where this database makes more sense (namely cases where views are not needed). There were multiple reasons for this choice, the most important ones were:
Performance: While precomputed indexes are a powerful asset, the main limitation of CouchDB is its QueryServer architecture. Every time a document is updated, it has to be serialized and processed by every view (even though this happens in a deferred manner, namely when the view is accessed). But more importantly, every time a view is updated (for example to add filtering logic for a new field added as part of the implementation of a new feature), ALL documents of the database must be sent to the view. This becomes a big deal when you have millions of documents in the database. You start worrying about the impact of updating your views and it becomes a distraction. Should you decide to create one database per data type to bypass this limitation, you'd then lose the ability to map/reduce across all your documents since views are scoped per database. MongoDB avoids this by segmenting documents into collections (ie. data types) so that when an index is updated only a subset of the data of the database is impacted. Moreover, MongoDB uses a binary format making these operations way more performant (while CouchDB uses JSON sent to the view server in plain text). This point may not be important if you do not design products needing to operate at large scale (hundreds of thousands of daily users or more).
the tooling available with MongoDB is comprehensive and mature, whether we are talking about the drivers officially supported for various programming languages, or integration with IDEs.
Advanced querying: A wide range of data types and advanced query capabilities are available out of the box (geo types, GridFS allowing one to store files of arbitrary size directly in the DB etc...). Having easy access to powerful query aggregation capabilities made me realize how much CouchDB had been inhibiting my productivity.
Seamless support for resharding: resharding is easy with MongoDB, while it is a dangerous operation involving moving files by hands with CouchDB.
Many other small items that improve quality of life and really add up.
I have been a big CouchDB fan but I have to admit that moving to MongoDB as a daily driver felt a lot like moving back to civilization in terms of productivity and quality of life improvement. Now I only consider CouchDB for key-value store scenarios (in which no map-reduce views are required and all that is needed is getting a document by key - CouchDB shines quite a lot for this), and advanced situations in which having per-user like databases is needed (for example to support advanced synchronization between devices).
The only drawback I see with MongoDB is that it consumes a lot of memory to the point that I cannot install it on development machines having low specs (while by comparison couchdb is launched at startup without me noticing and consumes almost no resource). However I feel this is worth it considering the time saved and the features provided.
As a long-time CouchDB user, the value I see in MongoDB is quite different from the items highlighted in the other answers promoting MongoDB so I felt it was important for me to provide this update (and also out of intellectual honestly when I remembered this post). CouchDB gave me quite a boost in productivity back in the days compared to the SQL products and ORMs I had been using, and at that time there were a lot of horror stories circulating regarding the reliability of MongoDB.
However, as of now, the few concerns I could have (and that were probably given disproportionate importance by internet folks - they essentially all boiled down to defaults whose reliability tradeoffs may surprise new users in a number of scenarios) no longer stand.
At this point, as a long-time CouchDB user in a great position to compare both products, I would recommend MongoDB to people needing a productive and scalable software development experience for their web app and advise to only pick CouchDB for specific needs.
CouchDB had momentum back in the days which probably influenced my perception, but development has stalled, no meaningful features have been introduced for a long-time, otherwise it would probably have caught up with MongoDB in terms of quality of life. I see two possible reasons for this: the way a now aborted rewrite of CouchDB has diverted resources for a long-time, and maybe early architectural decisions (such as the Query Server architecture) that may very well have restricted its future from the start. None of these aspects seem to be the priority of the core team.
I do not totally regret choosing CouchDB because it has been massively helpful and the mindset it has taught me is extremely helpful to allow me to write performant code in MongoDB (writing performant code in MongoDB is a breeze compared to the discipline one has to observe to solve business problems using CouchDB). However if I had to do it again today, I would have transitioned to MongoDB as my daily driver MUCH sooner. I'm usually quite good at picking the winning horse when technologies popup, but this time it seems I haven't played the game that well. Hope this helps.

Ask this questions yourself? And you will decide your DB selection.
Do you need master-master? Then CouchDB. Mainly CouchDB supports master-master replication which anticipates nodes being disconnected for long periods of time. MongoDB would not do well in that environment.
Do you need MAXIMUM R/W throughput? Then MongoDB
Do you need ultimate single-server durability because you are only going to have a single DB server? Then CouchDB.
Are you storing a MASSIVE data set that needs sharding while maintaining insane throughput? Then MongoDB.
Do you need strong consistency of data? Then MongoDB.
Do you need high availability of database? Then CouchDB.
Are you hoping multi databases and multi tables/ collections? Then MongoDB
You have a mobile app offline users and want to sync their activity data to a server? Then you need CouchDB.
Do you need large variety of querying engine? Then MongoDB
Do you need large community to be using DB? Then MongoDB

I summarize the answers found in that article:
http://www.quora.com/How-does-MongoDB-compare-to-CouchDB-What-are-the-advantages-and-disadvantages-of-each
MongoDB: Better querying, data storage in BSON (faster access), better data consistency, multiple collections
CouchDB: Better replication, with master to master replication and conflict resolution, data storage in JSON (human-readable, better access through REST services), querying through map-reduce.
So in conclusion, MongoDB is faster, CouchDB is safer.
Also: http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb

Be aware of an issue with sparse unique indexes in MongoDB. I've hit it and it is extremely cumbersome to workaround.
The problem is this - you have a field, which is unique if present and you wish to find all the objects where the field is absent. The way sparse unique indexes are implemented in Mongo is that objects where that field is missing are not in the index at all - they cannot be retrieved by a query on that field - {$exists: false} just does not work.
The only workaround I have come up with is having a special null family of values, where an empty value is translated to a special prefix (like null:) concatenated to a uuid. This is a real headache, because one has to take care of transforming to/from the empty values when writing/quering/reading. A major nuisance.
I have never used server side javascript execution in MongoDB (it is not advised anyway) and their map/reduce has awful performance when there is just one Mongo node. Because of all these reasons I am now considering to check out CouchDB, maybe it fits more to my particular scenario.
BTW, if anyone knows the link to the respective Mongo issue describing the sparse unique index problem - please share.

I'm sure you can with Mongo (more familiar with it), and pretty sure you can with couch too.
Both are documented oriented (JSON-based) so there would be no "columns" but rather fields in documents -- but they can be fully dynamic.
They both do it you may want to look at other factors on which to use: other features you care about, popularity, etc. Google insights and indeed.com job posts would be ways to look at popularity.
You could just try it I think you should be able to have mongo running in 5 minutes.

Disadvantages of CouchDB

I've very recently fallen in love with CouchDB. I'm pretty excited by its enormous benefits and by its beauty. Now I want to make sure that I haven't missed any show-stopping disadvantages.
What comes to your mind? Attached is a list of points that I have collected. Is there anything to add?
Blog posts from as late as 2010 claim "not mature enough" (whatever that's worth).
Slower than in-memory DBMS.
In-place updates require server-side logic (update handlers).
Trades disk vs. speed: Databases can become huge compared to other DBMS (compaction functionality exists, though).
"Only" eventual consistency.
Temporary views on large datasets are very slow.
Replication of large databases may fail.
Map/reduce paradigm requires rethinking (only for completeness).
The only point that worries me is #3 (in-place updates), because it's quite inconvenient.

The data is in JSON
Which means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they add up to the document size.
No built in full text search
Although there are ways: couchdb-lucene, elasticsearch
plus some more:
It doesn't support transactions
It means that enforcing uniqueness of one field across all documents is not safe, for example, enforcing that a username is unique. Another consequence of CouchDB's inability to support the typical notion of a transaction is that things like inc/decrementing a value and saving it back are also dangerous. There aren't many instances that we would want to simply inc/decrement some value where we couldn't just store the individual documents separately and aggregate them with a view.
Relational data
If the data makes a lot of sense to be in 3rd normal form, and we try to follow that form in CouchDB, we are going to run into a lot of trouble. A possible way to solve this problem is with view collations, but we might constantly going to be fighting with the system. If the data can be reformatted to be much more denormalized, then CouchDB will work fine.
Data warehouse
The problem with this is that temporary views in CouchDB on large datasets are really slow. Using CouchDB and permanent views could work quite well. However, in most of cases, a Column-Oriented Database of some sort is a much better tool for the data warehousing job.
But CouchDB Rocks!
But don't let it discorage you: NoSQL DBs that are written in Erlang (CouchDB, Riak) are the best, since Erlang is meant for distributed systems. Have fun with Couch!

2 more things, which make me cry when using CouchDB (though it's awesome):
It is not designed for frequently updated data
It doesn't have built-in fulltext search

Lack of reader ACLs (does exist for writers, however)
As an old Lotus Domino pro I was looking to CouchDB as an alternative for a new project I'm kicking off and found the limits on readers to be very weak in Couch vs. Domino. In my app security is an important consideration and Couch would require a middleware layer to handle reader security.
If you have database in which it's okay that all defined users can see all the documents, then Couch looks like an interesting platform.
If restricting reads is needed then you'll need to look to a middleware solution or consider another alternative.
Note to CouchDB developers: Improve the platform security options. I realize they will diminish performance when used but note that and make the option available.
Now back to determining which database to use...

currently no support for ad-hoc queries (might change with advent of UnQL)
lack of binary protocol support for faster communication

It's nothing to do with CouchDB itself, but being a relative newcomer on the scene means that most sysadmins are still unfamiliar with it and won't allow it anywhere near "their" data centers. If you're in a situation where you're deploying to an environment you don't control yourself, this can be quite the battle.

Lack of support for data archiving - No official support for data
archiving is provided with couch db open source distribution.
Deleting records from db is not straightforward
No option to set a expire (TTL) flag for documents

Best NoSQL approach to handle 100+ million records

I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes).
To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).
What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?
Some options I am thinking of are:
Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).
To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.
Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id.
What BigTable implementation will be suitable for this? I was considering HBase.
Use MongoDB for storing data and for querying and lookup by id.

recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.

It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

As you suggest MongoDB (or any similar NoSQL persistence solution) is an appropriate fit for you. We've run tests with significantly larger data sets than the one you're suggesting on MongoDB and it works fine. Especially if you're read heavy MongoDB's sharding and/or distributing reads across replicate set members will allow you to speed up your queries significantly. If your usecase allows for keeping your indexes right balanced your goal of getting close to 20ms queries should become feasable without further caching.

You should also check out the Lily project (lilyproject.org). They have integrated HBase with Solr. Internally they use message queues to keep Solr in sync with HBase. This allows them to have the speed of solr indexing (sharding and replication), backed by a highly reliable data storage system.

you could group requests & split them specific to a set of data & have a single (or a group of servers) process that, here you can have the data available in the cache to improve performance.
e.g.,
say, employee & availability data are handled using 10 tables, these can be handled b a small group of server (s) when you configure hibernate cache to load & handle requests.
for this to work you need a load balancer (which balances load by business scenario).
not sure how much of it can be implemented here.

At the 100M records your bottleneck is likely Hibernate, not Oracle. Our customers routinely have billions of records in the individual fact tables of our Oracle-based data warehouse and it handles them fine.
What kind of queries do you execute on your table?

MongoDB for personal non-distributed work

This might be answered here (or elsewhere) before but I keep getting mixed/no views on the internet.
I have never used anything else except SQL like databases and then I came across NoSQL DBs (mongoDB, specifically). I tried my hands on it. I was doing it just for fun, but everywhere the talk is that it is really great when you are using it across distributed servers. So I wonder, if it is any helpful(in a non-trivial way) for doing small projects and things mainly only on a personal computer? Are there some real advantages when there is just one server.
Although it would be cool to use MapReduce (and talk about it to peers :d) won't it be an overkill when used for small projects run on single servers? Or are there other advantages of this? I need some clear thought. Sorry if I sounded naive here.
Optional: Some examples where/how you have used would be great.
Thanks.

IMHO, MongoDB is perfectly valid for use for single server/small projects and it's not a pre-requisite that you should only use it for "big data" or multi server projects.
If MongoDB solves a particular requirement, it doesn't matter on the scale of the project so don't let that aspect sway you. Using MapReduce may be a bit overkill/not the best approach if you truly have low volume data and just want to do some basic aggregations - these could be done using the group operator (which currently has some limitations with regard to how much data it can return).
So I guess what I'm saying in general is, use the right tool for the job. There's nothing wrong with using MongoDB on small projects/single PC. If a RDBMS like SQL Server provides a better fit for your project then use that. If a NoSQL technology like MongoDB fits, then use that.

+1 on AdaTheDev - but there are 3 more things to note here:
Durability: From version 1.8 onwards, MongoDB has single server durability when started with --journal, so now it's more applicable to single-server scenarios
Choosing a NoSQL DB over say an RDBMS shouldn't be decided upon the single or multi server setting, but based on the modelling of the database. See for example 1 and 2 - it's easy to store comment-like structures in MongoDB.
MapReduce: again, it depends on the data modelling and the operation/calculation that needs to occur. Depending on the way you model your data you may or may not need to use MapReduce.

HBase cassandra couchdb mongodb..any fundamental difference?

I just wanted to know if there is a fundamental difference between hbase, cassandra, couchdb and monogodb ? In other words, are they all competing in the exact same market and trying to solve the exact same problems. Or they fit best in different scenarios?
All this comes to the question, what should I chose when. Matter of taste?
Thanks,
Federico

Those are some long answers from #Bohzo. (but they are good links)
The truth is, they're "kind of" competing. But they definitely have different strengths and weaknesses and they definitely don't all solve the same problems.
For example Couch and Mongo both provide Map-Reduce engines as part of the main package. HBase is (basically) a layer over top of Hadoop, so you also get M-R via Hadoop. Cassandra is highly focused on being a Key-Value store and has plug-ins to "layer" Hadoop over top (so you can map-reduce).
Some of the DBs provide MVCC (Multi-version concurrency control). Mongo does not.
All of these DBs are intended to scale horizontally, but they do it in different ways. All of these DBs are also trying to provide flexibility in different ways. Flexible document sizes or REST APIs or high redundancy or ease of use, they're all making different trade-offs.
So to your question: In other words, are they all competing in the exact same market and trying to solve the exact same problems?
Yes: they're all trying to solve the issue of database-scalability and performance.
No: they're definitely making different sets of trade-offs.
What should you start with?
Man, that's a tough question. I work for a large company pushing tons of data and we've been through a few years. We tried Cassandra at one point a couple of years ago and it couldn't handle the load. We're using Hadoop everywhere, but it definitely has a steep learning curve and it hasn't worked out in some of our environments. More recently we've tried to do Cassandra + Hadoop, but it turned out to be a lot of configuration work.
Personally, my department is moving several things to MongoDB. Our reasons for this are honestly just simplicity.
Setting up Mongo on a linux box takes minutes and doesn't require root access or a change to the file system or anything fancy. There are no crazy config files or java recompiles required. So from that perspective, Mongo has been the easiest "gateway drug" for getting people on to KV/Document stores.

CouchDB and MongoDB are document stores
Cassandra and HBase are key-value based
Here is a detailed comparison between HBase and Cassandra
Here is a (biased) comparison between MongoDB and CouchDB

Short answer: test before you use in production.
I can offer my experience with both HBase (extensive) and MongoDB (just starting).
Even though they are not the same kind of stores, they solve the same problems:
scalable storage of data
random access to the data
low latency access
We were very enthusiastic about HBase at first. It is built on Hadoop (which is rock-solid), it is under Apache, it is active... what more could you want? Our experience:
HBase is fragile
administrator's nightmare (full of configuration settings where default ones are less than perfect, nontransparent configuration, changes from version to version,...)
loses data (unless you have set the X configuration and changed Y to... you get the point :) - we found that out when HBase crashed and we lost 2 hours (!!!) of data because WAL was not setup properly
lacks secondary indexes
lacks any way to perform a backup of database without shutting it down
All in all, HBase was a nightmare. Wouldn't recommend it to anyone except to our direct competitors. :)
MongoDB solves all these problems and many more. It is a delight to setup, it makes administrating it a simple and transparent job and the default configuration settings actually make sense. You can perform (hot) backups, you can have secondary indexes. From what I read, I wouldn't recommend MapReduce on MongoDB (JavaScript, 1 thread per node only), but you can use Hadoop for that.
And it is also VERY active when compared to HBase.
Also:
http://www.google.com/trends?q=HBase%2CMongoDB
Need I say more? :)
UPDATE: many months later I must say MongoDB delivered on all accounts and more. The only real downside is that hosting companies do not offer it the way they offer MySQL. ;)
It also looks like MapReduce is bound to become multi-threaded in 2.2. Still, I wouldn't use MR this way. YMMV.

Cassandra is good for writing the data. it has advantage of "writes never fail". It has no single point failure.
HBase is very good for data processing. HBase is based on Hadoop File System (HDFS) so HBase dosen't need to worry for data replication, data consistency. HBase has the single point of failure. I am not really sure that what does it's mean if it has single point of failure then it is somhow similar to RDBMS where we have single point of failure. I might be wrong in sense since I am quite new.
How abou RIAK ? Does someone has experience using RIAK. I red some where that you need to pay, I am not sure. Need explanation.
One more thing which one you will prefer to use when you are only concern to reading a lot of data. You don't have any concern with writing. Just imagine you have database with pitabyte and you want to make fast search which NOSQL database would you prefer ?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse