Is Cassandra ready for prime time yet? - nosql

I started looking into Cassandra and I am really impressed with what it provides, but at the same time I read about how Reddit had a fire drill after migrating to Cassandra, and about twitter deciding to not using it for tweets. Although those were about a year ago or so, I am wondering if the latest version is ready for prime time yet?

Netflix has talked extensively about how they are moving from Oracle and SimpleDB entirely to Cassandra.
Twitter was also at the Cassandra Summit a few weeks ago talking about how they use Cassandra for multiple projects; Reddit had some early problems with being under-capacity, but later said, "Our traffic more than tripled [in 2010], and the transparent scalability afforded to us by Apache Cassandra is in large part what allowed us to do it on our limited resources."
There are many other companies using Cassandra (and DataStax customers are the tip of the iceberg).
In short, Cassandra is solving real problems for real companies. Just don't go into it expecting MySQL and you'll be fine. The DataStax documentation is a good starting point.
(Chris is mistaken about API stability: we were clear that after 0.7 we would be strict about maintaining backwards compatibility, and we have, even for "maintenance" operations like schema updates and mixed-version cluster operation for downtime-free upgrades. I would also note that unlike many "NoSQL" databases, Cassandra has always taken data durability seriously.)

Cassandra is still under very heavy development. The API is still changing, and in that respect, no the product isn't stable. There are still occasional glitches, and a number of kinks to be worked out. It is still a very young product with a long way to reach before actual maturity.
Having said that, Cassandra is quite capable, provided that you are capable of structuring your data in a manner suited to Cassandra's strong points. In other words, if you play to Cassandra's strengths I think you'll find that it's "mature enough" at this point. There are already a number of large sites that use cassandra, and in this regard it's certainly ready for "prime time" (whatever that really means).
It will be years (if ever) before it has the same reputation and stability as a traditional DBMS like MySQL.

Related

PostgreSQL 8.4 Win 32Bit, replication strategies to 64 bit Linux

So we have an application that uses PostgresSQL 8.4 on windows (yeah I know)..
We have several of these apps in our country.
What we want to do is have a linux server in a data centre, that stores a full copy of the database, and have the data stream into it fairly regularly.
This doesn't need to be real-time 100% consistent, but we want to get as close to that as possible as we will use to track sales data through the day.
The "slave" (data centre) doesn't need to do anything other then receive all the data, and then an app will run some reports on it.
I've looked into it, slony, pgpool, running 32 bit PostgreSQL on 64 bit linux etc but it's a big area so looking for some advise on our less then ideal setup.
Your basic options are, as Craig pointed out, Bucardo, Londiste, and Slony. These are all somewhat complex to set up compared to streaming replication.
The big thing you can't do is use the streaming replication or similar solutions. These apply architecture (and major-version) -specific log files, and so going across architectures will on good days just not work and on bad days lead to data corruption on the slave. Don't do it.
These three solutions pull the data out in an architecture-independent format and send it through an additional infrastructure to be saved on the slave. There are big tradeoffs here and I would recommend thoroughly researching each option thoroughly before committing.
One thing to keep in mind is that the PostgreSQL community is typically quite adamant that there is no one-size-fits-all replication solution possible and so the multitude of options leads to many often solutions each of which is usually quite specialized.
Of these, Slony is probably the most configurable and Londiste is the simplest. They are for very different use cases though. If i have time and nobody beats me to it, I may post a comparison of the three or at least link to others.
Update: Brief comparison.
Slony-I
Slony-I is the the oldest and most powerful logical replication system available. I actually prefer to think of Slony-I as a replication toolkit rather than a solution. The toolkit approach offers incredible flexibility and the ability to solve all kinds of problems in complex environments. The downside is that the flexibility is complexity. As I put it, "Slony will happily let you replicate only part of your database. On the other hand, Slony will happily let you replicate only part of your database." It is an extremely helpful solution and makes all kinds of things possible, but the complexity is much higher than the other solutions.
One major advantage of Slony however is the fact that it has tools for managing DDL changes. Londiste and Bucardo do not to my knowledge. This means that adding columns to tables is possible on Slony but not so much on the other systems.
Bucardo
This is somewhere between Londiste and Slony in complexity. It has the primary useful feature of being able to do multi-master replication between two masters. It uses Perl extensively. I don't know how well it has been tested on Windows, and this may be a drawback.
Londiste
Londiste is Skype's master-slave replication system built on pgq (basically an event queue connected to PostgreSQL with events raised on database actions). It has a reputation of being easy to set up but not readily protecting replicas against modification. this of course could be a feature or a bug depending on how you want to look at it.

NoSQL (e.g. MongoDB) or RDMS (e.g. PostgreSQL) for new Scala project?

I'm developing a brand new project in Scala. It's just an application for a bunch of CRUD operations, however, because of some eccentric requirements, Play2 or Lift does not fit the bill, so I'm going to develop the application from the ground up. This means that Anorm or ScalaQuery becomes less obvious choices for database integration, and leaves me with the question: is it time to try something new?
My past technology stacks mostly included Java and PostgreSQL and I have experience with both ORM and plain SQL. Are NoSQL database management systems like MongoDB a good replacement for a typical RDBMS or are they special case application data stores? Also, how does the choice of database effect the greater Scala system design (if at all)? For example, the fact that you are using a JSON-like interface to talk to the database, and JSON between the web and a REST service, does not mean that much if everything in the middle becomes Scala objects, or does it?
I'm basically asking for someone's experience on moving from relational to object/document type databases, using Scala in particular. I know that good RDBMS integration is promised in the upcoming release of SLICK. So, if a company like TypeSafe decides to make a RDBMS integration part of the TypeSafe stack, then will I be swimming upstream by integrating to MongoDB using Casbah for example?
Apologies if this question appears a bit vague. I do hope that someone with the right insights or experience will be able to help though.
Update:
Apologies for not adding links to SLICK (it being fairly new). Here goes:
Quick overview
Project home
Update 2:
My personal first win for a technology is usually developer productivity - this translates to lightweight and simple: quick to learn, easy to maintain, no magic
I am currently in a similar situation, and since I have some experience with web development and SQL databases, I took it as an opportunity to work with MongoDB, Cashbah (and Scalatra). My experience is still very limited and the project and the amount of data I am working with is pretty small, but here are a few observations I've made.
For the few sets of data I have, performance does not seem to motivate either SQL or NoSQL. However, performance in the presence of huge amounts of data is often listed as a reason for using NoSQL, e.g., by Wikipedia
My documents (entries in the database) arise from benchmarking test suits, and mainly have a static structure, and I am optimistic that I could store them in a fixed-schema SQL database. However, a few substructures are not static, e.g., new test cases are added, new statistics are tracked, others are removed. This was my main motivation for trying a schema-free NoSQL database. Also, because I had the feeling that the document approach of MongoDB makes it much more obvious which data belongs together (i.e., to a document), in contrast to entries in a relational database, where the data would be distributed over various tables and rows, and where a full "document" would need to be reconstructed by joins.
Tools such as Lift-Json or Rogue allow you to work with regular Scala objects in a type-safe, although the data is regularly (de-)serialised as (from) JSON. However, this naturally works best if the structure of your data is mainly static, otherwise, you you are left with using strings to access your data (e.g., for expanding the results of a query using Cashbah).
If you are mainly concerned about a coherent representation of data on server and client side, languages such as Opa or Haxe might be of interest, since they compile to code that can executed on both sides. See this page for "multitarget" or "tierless" languages.
Got too long for a comment. Was just trying to relate my short experience with Scala (about 6 months now, since about when Play2 came out--it's quickly become my go to language).
I've enjoyed using Salat/Casbah with MongoDB in my last few projects; most have been in Play2, but the latest was without a webapp framework. It definitely hasn't felt like swimming upstream.
I would say that there are particular use cases for which I wouldn't use mongo, but it works nicely as a general purpose object data store, especially if you expect to query by id or index and don't need transactions (and will need minimal ad-hoc aggregation type stuff).
Expect to require a separate set of servers dedicated to mongodb (or to use a service dedicated to mongodb), but I guess that's normal for most serious database apps.
I've also used Play2/Anorm, which was surprisingly enjoyable to use for some ad-hoc query dashboard-style report pages. I started trying to go the Squeryl route, but Anorm seemed easier to use for one-off aggregation queries. Haven't looked at SLICK, but it sounds interesting.
It's really hard to say without knowing what problems you would like the app to solve.
I've personally found my productivity increased using NoSQL DBs via REST/JSON. Though bear in mind most NoSQL DBs offer REST interfaces which preclude the need for much middleware, Scala or otherwise, unless you intend to write a webapp with a UI.
If this is a learning exercise, I recommend you try multiple things out, as each NoSQL DB has something different to offer to your toolkit, and have personally found CouchDB, Riak, Neo4j, and MongoDb all with various pluses and drawbacks and good for different purposes.
Hope this helps, good luck.

Best suited NoSQL database for Content Recommender

I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context
Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.

Is CouchDB a good persistent layer for Membase?

Membase is great for social game due to it's low latency.
As I understand CouchDB is a MVCC system using b+ tree, with a focus on append only design.
(http://guide.couchdb.org/draft/btree.html)
One of the most important scenario of Membase is social game.
Social game has a lot of write operations (50+%).
And a good portion of them are in-place updates.
So why is CouchDB a suitable persistent layer for Membase?
I'd also add that CouchDB's append-only log format really doesn't have much relation to whether application writes are new items or updates. The append-only format gives us much better reliability and performance than an in-place system (like sqlite...which is still quite reliable). It's also much easier to take backups of.
Does Membase NEED an append-only log format? maybe not...does it NEED CouchDB?...YES!
The benefits of map-reduce and indexing as well as eventually consistent replication that CouchDB brings are nothing less than huge for Membase...and the benefits of low-latency, clustering and UI that Membase brings to CouchDB are arguably just as important.
(Disclosure: I work for Couchbase)
Perry Krug
CouchDB has great file formats, great ability to recover from crashes, sophisticated authentication and authorization tools, and a universal, standard, interface: HTTP. CouchDB is poor at low-latency queries, optimized memory utilization, and heavy update speeds (a million per second).
Membase currently has only a simple SQLite file format for persistence, less sophisticated authentication and authorization, using a more obscure protocol. Membase is amazing for low-latency queries, ideal memory utilization, and heavy update speeds.
I think the two complement each other very well. Since the merging effort is coming from core developers in both projects, collaborating together, I expect to see the strengths of both and the weaknesses of neither. Yes, CouchDB is a good persistence layer for Membase.
Money speaks and if there ever was a vote of confidence then here it is, not only from a new lead investor but also from the existing ones as well.
http://www.couchbase.com/press-releases/couchbase-series-C
Besides, don't you think that Membase itself is more than well enough qualified to make an evaluation for such a merger decision?

MongoDB versus CouchDB... And any other "major players"

What are the major differences between MongoDB and CouchDB, and are there any other major NO-SQL database-servers out there worth mentioning?
I know that CERN uses CouchDB somewhere in their LHC back-end; huge stamp of approval. What are MongoDB - and any other major servers' - references?
Update
One of the major selling points of CouchDB, to me, is the REST-based API and seamless JavaScript integration using JSON as a data-wrapper. Is this possible with any of the other NO-SQL databases mentioned?
There are many more differences, but some quick points:
CouchDB has MVCC (Multi Version Concurrency Control) - each time a document is updated, a NEW version of it is created. Whereas MongoDB is update-in-place.
CouchDB has support for multi-master, so you can write to any server. MongoDB only has 1 server active for write (master-slave) - However: I this this may have changed in the latest release (1.6) so MongoDB may now support multiple servers for writes
To see who's using MongoDB see here (e.g. foursquare, bit.ly, sourceforge....)
To see who's using CouchDB see here.
The most notable other NoSQL database is Cassandra (facebook, twitter)
Then you have HBase, HyperTable, RavenDB, SimpleDB, and more still...
Welcome to some new ground #AdaTheDev covered most of the major ones. There's also Project Voldemort, Tokyo Cabinet/Tyrant, and a whole bunch of wrappers around all of these things. So people are also building MemcacheDB (memcache with a persistence layer).
MongoDB has several hooks to support "REST" APIs (check out "Sleepy Mongoose" and Node.js support). MongoDB and CouchDB have different ways of handling map-reduces (though they are somewhat similar). MongoDB does not have MVCC, but the two systems really have different ways of storing data each with their own set of trade-offs.
MongoDB uses language-specific drivers where CouchDB uses REST (performance trade-off).
For more detailed comparison look here.
MongoDB is probably a little easier for a relational developer to grasp since it uses drivers and has better support for ad hoc queries. CouchDB has very little in common with the old relational ways of doing things.
Both deal with sharding and replication differently.
Having said that, I believe both are conceptually similar enough that it often boils down to personal preference. They are all fun to code with. In fact, we evaluated both for an internal project and went back and forth with our decision.