Which NoSQL databases support text array columns (and indexes on this columns) like the postgreSQL text[] type? - nosql

I need to move data from a postgreSQL to a NoSQL database, in the process we are evaluating different NoSQL databases and Cassandra came up as a possibility but from the documentation it seems like Cassandra doesn't support having a text array as a column type, is this correct? Which NoSQL databases support this type of columns and support indexes on this type of columns?
For example to store this and have an index on a column with this type of data:
City:['Washington','Washington DC']
Thanks in advance!

Not exactly an answer to your question (not enough reputation to comment (?!?)), but understanding that your problem is scale, and you are coming from PostgreSQL, have you tried PostgresXC yet? That may be a much easier transition than to NoSQL. NoSQL databases, as I assume you know, have very different performance characteristics and nuances that might actually do more harm than good. Postgres-XC is a multi-master write-scalable fork of PostgreSQL and sits somewhere between 9.1 and 9.2 from a PostgreSQL feature standpoint and it is an active project. 9.2 conformance was slated this month or last if I recall correctly. It's relatively easy to set up for what it is - you'll build 2 GTM's, one as a primary and one as a failover, give them enough memory. Then you can scale horizontally by adding pairs of coordinators and data nodes, 1 coordinator and 1 data node per server. Your application tier can talk to any of the coordinators, transactions are shipped to the appropriate coordinators and you can specify the distribution of your data by table - either replicated for small reference tables or distributed for large ones. If you design your queries well, you can get massive performance improvement because your queries can be shipped and executed simultaneously on multiple coordinator/data node pairs.
I know you are looking for NoSQL, but I mention this because we too had a vertical vs horizontal scale problem and in the end I found it was easier to build NoSQL capability into a relational system than it was to build relational capability into a NoSQL system. And of course it all depends on your data, sometimes NoSQL is absolutely the best choice. Sometimes it can be a major headache too, for example some NoSQL databases have problems with filesystem growth so whereas you thought you bought horizontal scalability you wound up eating your SAN out of house and home.
Anyway, hope that helps! I would have left it as a comment but stackoverflow has that strange reputation thing going on.
I forgot to mention also, with Postgres-XC you can specify on which columns you wish to distribute and by what kind of algorithm. I typically distribute by hash, and make sure of two things, first that hash can be generated application-side so that I don't have to do joins on tables that are gadzillions of rows and second that the hash keeps the distribution level across servers correct but while also keeping related information together on the same server so as to increase the shippability of queries. That is, if you have a customer table and a customer orders table, distribute both on a hash of some customer unique information that is in both tables and make sure you can generate that application-side. I hope that makes sense, I'm not sure if I did a good job explaining. If you would like further clarification on that please let me know, the docs are a bit scattered on XC right now, so a lot of what I related is OJT.

Related

Guideline when designing a database in Postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.
Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

NoSQL in a single machine

As part of my university curriculum I ended up with a real project which consists in helping a company shifting from their relational data warehouse into a NoSQL data warehouse. The thing is that what they are looking for is better performance in large jobs but so far they have used a single machine and if they indeed migrate to NoSQL they still wish to keep using a single machine for cost reasons.
As far as I know the whole point of NoSQL is to run it in a large distributed system of several machines. So I don't see the point of this migration, specially since I am pretty sure (but not entirely) that if they do install NoSQL, they will probably end having even worst performance.
But still I am not comfortable telling them this since I am still new to this area (less than a month), so I wonder, is there are any situation where using NoSQL in a single machine for a datawarehouse would be justifiable performance wise? Or is it just a plain bad idea?
The answer to your question, like the answer to so many questions, is "it depends."
Ignoring the commentary on the question, I think there may be legitimacy to your client's question. Both relational and non-relational databases ultimately hold data in key-value tuples, with indexes and such to ensure quick and speedy access to the data. The difference is that SQL/relational databases contain an incredible amount of overhead to attempt the optimal way to retrieve results given an unknown set of queries, as well as ensure stable concurrency. This overhead is both computationally expensive and rarely results in the optimal solution. As a result, SQL databases often perform significantly slower for simple repetitive queries.
No-sql databases, on the other hand, are more of a "bare-bones" database, relying on programmers and intelligent design to achieve success. They are optimized to retrieve a value for a given key very quickly, often sub-millisecond. As a result, increased up front investment in the design results in superior and near-optimal performance. It will be necessary to determine the cost-benefit of doing this up-front design, but it is all but guaranteed that the no-sql approach will perform better regardless of the number of machines involved (in fact, SQL databases are very difficult or impossible to cluster together and is one of the main reasons why NoSql was developed).
Eventually we will see relational-like solutions implemented on a no-sql platform. In fact, Mongo, Elasticsearch, and Couchbase (probably others) already have SQL-like query functionality. But right now, you are faced with this dilemma.
For a single machine if the load is write heavy e.g. your logging a lot of events you could do for cassandra. Also a good alternative is hbase but its heavy and not suggested for single node. If they expose api in json you could look into document based dbs such as couchbase, mongo db. If you have an idea about the load then selecting a nosql data store is much easier
If you're in a position where you need to pick one, I think you should look first at MongoDB. If you've never tried it, I really recommend you visit their live demo with tutorial and give it a try. If you like, download and follow the installation guide on their site. It's free, runs well on a single machine, and is incredibly easy to use.
In addition to MongoDB, I've used Oracle, SQL Server, MySQL, SQLite, and HBase. I understand Cassandra should be in the list but I've not tried it. With MongoDB, I was fully deployed and executing reads and writes from an application in like two hours. I attribute most of that to their website's clear and concise instructional content. The biggest learning curve was figuring out how the queries work for things like updating a record or deleting a record without deleting the entire set of similar records.
Regarding NoSQL vs RDBMS, some points to consider:
Adding a new column to RDBMS table can lock the database in or degrade performance in another
MongoDB is schema-less so adding a new field, does not effect old documents and will be instant (think how flexible that really is - throw any dimension of data into this system without maintenance overhead)
You're less likely to require a DBA to solve your schema problems when an application changes
I think problems related to table size are irrelevant, so you won't run into a scaling problem - just a disk space problem on single machine

Shifting from SQL to NoSQL and to which DB?

We recently are having major performance related issues in our current SQL Server DB.
Our application is pretty heavy on a single table we did some analysis and about 90% of our db data is in a single table. We run lot of queries on this table as well for analyticall purposes we are experiencing major performance issues now even with a single column addition sometimes slows our current Sp. Most of our teams are developers and we don't have access to a dba which might help in retuning our current db and make things work faster.
Cause of these constraints we are thinking of moving this part of the app to a NoSQL db.
My Questions are :
If this is the right direction we are heading ? As we are expecting exponential growth on this table. With loads of analytic's running on it.
Which would be best option for us CouchDB , Cassandra , MongoDB ? With stress on scalability and performance
For real time analysis and support similar to SQL how things work in a NoSQL is there a facility through which we can view current data being stored? I had read somewhere about Hadoop’s HIVE can be used to write and retreive data as SQL from NoSQL db's am I right?
What might be things which we would be losing out of while shifting from SQL to NoSQL ?
To your questions:
1.. If this is the right direction we are heading ? As we are expecting exponential growth on this table. With loads of analytic's running on it.
Yes, most of the noSQL systems are developed specifically to address scalability and availability, if you use them in the intended way.
2.. Which would be best option for us CouchDB , Cassandra , MongoDB ? With stress on scalability and performance
This depends entirely on what does your data looks like and how you will use it. The noSQL db you mentioned are implemented and behaves very differently from each other, see this link for a more detailed overview comparing the few you mentioned. Comparisons of noSQL solution
3.. For real time analysis and support similar to SQL how things work in a NoSQL is there a facility through which we can view current data being stored? I had read somewhere about Hadoop’s HIVE can be used to write and retreive data as SQL from NoSQL db's am I right?
This depends on the system you go with, because some noSQL db doesn't support range queries or joins, you are restricted in what you can view and how fast you can view.
4.. What might be things which we would be losing out of while shifting from SQL to NoSQL?
There are two major considerations for noSQL:
Query/Structure: NoSQL means no SQL. If your system actually requires structured and complex queries but you went with one of these cool new solution (especially a key-value storage, which is basically a giant hash table), you may soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of your original problems.
Consistency: If you choose a eventual consistent system to scale horizontally, then you will have to accept your data being outdated, which may be harmless to some applications (forums?) or horrible in some other systems (bank).
I think you should stay relational and tune the table, its indexes, and the tables it joins to. You should also consider the use of aggregated (summarized data). Perhaps a more denormalized design would help or even re-designing the data into more of a star structure. Also, operational processing and decision support (or reporting) analyses should not be run on the same tables.
It might be possible to improve the SQL approach by checking for missing indexes etc and also seeing if the isolation level you are using is optimal. It may be possible to use snapshot isolation etc to improve performance. MSDN link
Read up on OLTP vs OLAP also.
NoSQL may still be a better option but you would still need to learn how to work with the database properly, it will come with another different set of issues.

Example of a task that a NoSQL database can't handle (if any)

I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.

So... this NoSQL thing

I've been looking at MongoDB and I'm fascinated. It appears (although I have to be suspicious) that in exchange for organizing my database in a slightly different way, I get as much performance as I have CPUs and RAM for free? It seems elegant, and flexible, but I'm not trading that for fast like I am with Rails. So what's the catch? What does a relational database give me that I can't do as well or at all with Mongo? In other words, why (other than immaturity of existing NoSQL systems and resistence to change) doesn't the entire industry jump ship from MySQL?
As I understood it, as you scale, you get MySQL to feed Memcache. Now it appears I can start with something equally performant from the beginning.
I know I can't do transactions across relationships... when would this be a big deal?
I read http://teddziuba.com/2010/03/i-cant-wait-for-nosql-to-die.html but as I understand it, his argument is basically that real businesses which use real tools don't need to avoid SQL, so people who feel a need to ditch it are doing it wrong. But no "enterprise" has to deal with nearly as many concurrent users as Facebook or Google, so I don't really see his point. (Walmart has 1.8 million employees; Facebook has 300 million users).
I'm genuinely curious about this... I promise I'm not trolling.
I am also a big fan of MongoDB. That having been said, it is absolutely not a wholesale replacement for RDBMS. Facebook has 300 million users but if some of your friends don't show up in the list one time, or one of the photo albums is missing on the occasional request, would you notice? Probably not. If your status update doesn't trickle down to all of your friends for a few minutes, does it matter? Hardly. If Wal-Mart's balance sheets are out of sync, would someone lose their head? Definitely.
NoSQL databases are great in "fuzzy" environments where relationships are not strict and data integrity can afford to be out of sync. RDBMS are still important when data sets are extremely complex and relational (hence the name), and they need to be kept pure.
The big push to NoSQL comes from the fact for the last 30 years, we have been using RDMBS systems for both scenarios. We now have a more appropriate tool for many situations. Some would argue most, in fact. But no one would argue all.
I write this but as a dispute to Rex's answer.
I dispute the idea that nosql is relationless and fuzzy.
I had been working with CODASYL many years ago with C and Cobol - entity relationships are very tight in CODASYL.
In contrast, relational database systems have a very liberal policy towards relationships. As long as you can identiy a foreign key, you could form a relationship adhoc.
It is frequently taken for granted that SQL is synonymous with RDBMS, but people have been writing SQL drivers for CODASYL, XML, inverted sets, etc.
RDBMS/SQL do not equal precision in data or relationship. In fact, RDBMS has been a constant cause in imprecision and misperception of relationships. I do not see how RDBMS offer better data and relationship integrity than hadoop, for example. Put on a layer of JDO - and we can construct a network of good and clean relationships between entities in hadoop.
However, I like working with SQL because it gives me the ability to script adhoc relationships, even though I realise that adhoc relationships is a constant cause of relationship adulteration and problems.
Having the opportunity to work with statistical analysis of business and industrial processes, SQL gave me the ability to explore relationships where no relationships had previously been perceived. The opportunity to work with statistical analysis gave me insights that would not normally come the way of SQL programmers.
For example, you would design and normalise your schema to reflect a set of processes. What you might not realise is that relationships change over time. The statistical characteristics would reveal that a schema may no longer be as "properly normalised" as it once had been. That the principal components of the processes have mutated over time. But non-statistical programmers do not understand that and continue to tout RDBMS as the perfect solution for data integrity and relationship precision.
However, in a relationship-linking database, you could link entities in relationships as they appear. When relationships mutate, the linking naturally mutate with the data. Relationships and their mutation are documented within the database system without the expensive need to renormalise the schema. At which point, RDBMS is good only as temp dbs.
But then you might counter that RDBMS too allows you to flexibly mutate your relationships, since that is what SQL does best. True, very true - so long as you perform BCNF or even 4NF. Otherwise, you would begin to see that your queries and data loaders performing replicated operations. But then your many years in the RDBMS business have so far certainly at least made you realise that BCNF is very expensive and operationally inefficient and that we are constantly guilty of 2.5 NFing our schemata.
To say that RDBMS and SQL promotes data and relationship integrity is a gross mis-statement. Either you work in a company that is so tiny or you didn't stay in your positions for more than two years - you would not see the amount of data or the information mutation and the problems caused by RDBMS. The abuse of RDBMS is the cause of executives being restricted in the view by computer applications and the cause of financial failures of companies failing to see changes in market behaviour because their views were restricted by the programmers whose views were restricted to their veneration of their beloved RDBMS schemata.
That is why SQL programmers do not understand why your company statistician refuses to use your application which you crafted meticulously but they employed a college intern to write SQL to download data into their personal servers and that your company executives learn to trust the accountants' and statisticians' spreadsheets rather than your elegant multi-tiered applications because of your applications' inability to mutate with processes.
It might not be possible, but I still urge you to acquire some statistical understanding to perceive how processes mutate over time so that you can make the right technological decision.
The reason people are not moving to SQL-less is lack of a good scripting environment like SQL to perform adhoc relationship analysis. Not because SQL-less technology is deficient in precision or integrity. Adhoc relationship analysis is very important nowadays due to the rapid and agile application development attitudes and strategies we have nowadays.
Let me hit the questions one at a time:
I know I can't do transactions across relationships... when would this be a big deal?
Picture cascading deletes. Or even just basic referential integrity. The concept of "foreign keys" can't really be enforced across "collections" (the Mongo term for tables). You can do atomic writes to only a single "document" (AKA record). So if you have a DB issue, you can orphan data in the DB.
I get as much performance as I have CPUs and RAM for free?
Not free, but definitely with a different set of trade-offs. For example, Mongo is great at running single-record, key/value look-ups. However, Mongo is poor at running relational queries. You'll need to use map-reduce for many of these. Mongo is a "RAM-whore". Mongo basically demands 64-bit for any significant dataset. Mongo will suck up drive space, load up a 140GB DB and you can end up using 200+ GB as the swap file grows during use.
And you're still going to want a fast drive.
In fact, I think it's safe to say the MongoDB is really a DB system that caters to leading-edge hardware (64-bit, lots of RAM, SSDs). I mean, the whole DB is centered around looking up data index data in RAM (hello 64-bit) and then doing focused random lookups on the drive (hello SSD).
why ... doesn't the entire industry jump ship from MySQL?
It's not ACID-compliant. Probably quite bad for the banking system (of course, most of them are still processing flat files, but that's a different issue). However, note that you can force "safe" writes with Mongo and guarantee that data gets to disk, but only one "document" at a time.
It's still very young. Lots of big business are still running old versions of Crystal Reports on their SQL Server 2000 app written in VB6. Or they're building enterprise service buses to manage the crazy heterogeneous environments they've built up over the years.
It's a very different paradigm. Maybe 30% of the questions I regularly see on Mongo mailing lists (and here) are fundamentally tied to "how do I do query X?" or "how do I structure this data?". Using MongoDB typically requires that you denormalize in advance. This is not only a little difficult, it's untrained. Most people only learn "normalization" in school, nobody teaches us how to denormalize for performance.
It's not the right tool for everything. Honestly I think that MongoDB is great tool for reading and writing transactional data. That simple "one-a-time" CRUD that comprises much of modern apps. However, MongoDB is not really great at reporting. In fact, I honestly envision that the next step is not "Mongo for everything" it's "Mongo for transactional" and "MySQL for reporting". When your data gets big enough that you throw out "real-time reporting", then using Map-Reduce to populate a reporting DB doesn't seem that bad.
As I understood it, as you scale, you get MySQL to feed Memcache. Now it appears I can start with something equally performant from the beginning.
Honestly, I'm working towards this on a few of my projects. Again, I think that MongoDB actually does make a valid caching layer. In fact, it makes a file-backed caching layer. So if you're capable of pushing MySQL change to Mongo, then you're getting getting Memcached without cache misses. It also makes it easy to "warm the cache" on new server, just copy files and start Mongo pointing at the correct folder, it really is that easy.
How often do you think Facebook does arbitrary queries against its datastore(s)? Not everything is a web app, and conversely not every set of data needs to be analyzed deeply.
NoSQL in my opinion, is largely a reactionary response to what basically amounted to people using RDBMS for tasks they were not well suited because people didn't actively make a decision based on their needs and chose some default. To "jump ship from MySQL" (or RDBMSs in general) industry-wide would be to make the same mistake all over again and the pendulum will end up swinging back the other way.
If MongoDB works for your use case, by all means go ahead. Just don't assume your use case is all use cases. There is no technology that fits all scenarios. The invention of the supersonic jets didn't eliminate the use of freight trains.
The big backlash against NoSQL is rooted in the mentality of many of the NoSQL advocates. Specifically, the attitude best summarized as "SQL is too hard, I shouldn't have to do it". I dislike NoSQL because it seems in many cases to be elevating ignorance.
I know I can't do transactions across relationships... when would this be a big deal?
More often than you might expect. There are a lot of things that can go wrong when you can't assume a consistent dataset.
I have used MongoDB, Redis (more than key-value pair supports list, set and sorted set), Tokyo Tyrant, Memcached and MySql & PostgreSQL.
The arguments between NoSQL DB And SQL based DB are completely baseless. You need to choose the appropriate model based on your use case.. If you need ACID compliances, go ahead with SQL DB like PostgreSQL, Oracle etc. You need high performance, but you less care about data, then you may consider noSQL DB. They are fundamentally different technologies. You can even use the combination of models. With NoSQL, you will be missing relationships, constraints and sometimes transaction.. In fact, thats is the one of the reason NoSQL are faster..
Once I have lost two months of aggregate data with MongoDB.. No clue how I lost them..But I had backup and I have lost few minutes of data. I brought back MongoDB with backup.. If you use NoSQL, take occasional backup or schedule cron jobs for DB backup. This is applicable for SQL DB also.
Compared to SQL RDBMS, NoSQL DBs are younger and they are currently under full fledged development but NoSQL DBs are matured in their scope ie they meant for high performance, easy replication.
In my website(stacked.in), I have used only redis DB, it works much much faster than MySQL.
Remember, NoSQL isn't exactly new. After all, they had to use something before SQL and relational databases, right? In fact, systems like MUMPS and CODASYL work the same way and are decades old. What relational databases give you is the ability to query data in arbitrary ways.
Say you have a database with customers, their purchases, and what items they purchased. A NoSQL DB might have customers containing purchases and purchases containing items. This makes it easy to find out what items a given customer purchased, but hard to find out what customers purchased a given item. A relational DB would have tables for customers, purchases, items, and a table linking items to purchases. In SQL, both queries are trivial to formulate, and the database engine does all the hard work for you.
Also, keep in mind that part of the NoSQL trend is to sacrifice consistency or reliability for speed, scalability, and cost. Relational DBs can scale, but it's not cheap. If you go to http://tpc.org you can find RDBMSes that run on hundreds of cores simultaneously to deliver millions of transactions per minute, but they cost millions of dollars.
If your data does not take advantage of relational algebra, nor do you need ACID guarantees, then you don't gain anything by using languages that cater exclusively for those uses.