Basically we are looking at solutions to keep our running costs down that will allow us to keep 3+ nodes in sync.
So currently we are looking at PostgreSQL when Cassandra came on our radar.
Can anyone tell me if Dapper.net and fluent-migration works with Cassandra from experience?
Also any advice on this kinda of setup would be appreciated.
So we have an application that uses PostgresSQL 8.4 on windows (yeah I know)..
We have several of these apps in our country.
What we want to do is have a linux server in a data centre, that stores a full copy of the database, and have the data stream into it fairly regularly.
This doesn't need to be real-time 100% consistent, but we want to get as close to that as possible as we will use to track sales data through the day.
The "slave" (data centre) doesn't need to do anything other then receive all the data, and then an app will run some reports on it.
I've looked into it, slony, pgpool, running 32 bit PostgreSQL on 64 bit linux etc but it's a big area so looking for some advise on our less then ideal setup.
Your basic options are, as Craig pointed out, Bucardo, Londiste, and Slony. These are all somewhat complex to set up compared to streaming replication.
The big thing you can't do is use the streaming replication or similar solutions. These apply architecture (and major-version) -specific log files, and so going across architectures will on good days just not work and on bad days lead to data corruption on the slave. Don't do it.
These three solutions pull the data out in an architecture-independent format and send it through an additional infrastructure to be saved on the slave. There are big tradeoffs here and I would recommend thoroughly researching each option thoroughly before committing.
One thing to keep in mind is that the PostgreSQL community is typically quite adamant that there is no one-size-fits-all replication solution possible and so the multitude of options leads to many often solutions each of which is usually quite specialized.
Of these, Slony is probably the most configurable and Londiste is the simplest. They are for very different use cases though. If i have time and nobody beats me to it, I may post a comparison of the three or at least link to others.
Update: Brief comparison.
Slony-I
Slony-I is the the oldest and most powerful logical replication system available. I actually prefer to think of Slony-I as a replication toolkit rather than a solution. The toolkit approach offers incredible flexibility and the ability to solve all kinds of problems in complex environments. The downside is that the flexibility is complexity. As I put it, "Slony will happily let you replicate only part of your database. On the other hand, Slony will happily let you replicate only part of your database." It is an extremely helpful solution and makes all kinds of things possible, but the complexity is much higher than the other solutions.
One major advantage of Slony however is the fact that it has tools for managing DDL changes. Londiste and Bucardo do not to my knowledge. This means that adding columns to tables is possible on Slony but not so much on the other systems.
Bucardo
This is somewhere between Londiste and Slony in complexity. It has the primary useful feature of being able to do multi-master replication between two masters. It uses Perl extensively. I don't know how well it has been tested on Windows, and this may be a drawback.
Londiste
Londiste is Skype's master-slave replication system built on pgq (basically an event queue connected to PostgreSQL with events raised on database actions). It has a reputation of being easy to set up but not readily protecting replicas against modification. this of course could be a feature or a bug depending on how you want to look at it.
I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context
Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.
I have read a lot of the MongoDB.
I like all the features it provides, but I wonder if it's possible to have it as the only database for my application, including storing sensitive information.
I know that it compromises the durability part in ACID but I will as a solution have 1 master and 2 slaves in different locations.
If I do that, is it possible to use it as the primary database, storing everything?
UPDATE:
Lets put it this way.
I really need a document storage rather than traditional dbms for be able to create my flexible application. But is MongoDB reliable enough to store customer sensitive information if I have multiple database replications and master-slave? Cause as far as I know one major downside is that it compromises the D in ACID. So I solve it with multiple databases.
Now there is not major problems such as lost of data issues?
And someone told me that with MongoDB a customer could be billed twice. Could someone enlighten this?
Yes, MongoDB can work as an application's only data store. Use replication and make sure you know about safe writes and w.
And someone told me that with MongoDB a customer could be billed
twice. Could someone enlighten this?
I'm guessing whoever told you that was talking about eventual consistency. Some NoSQL databases are eventually consistent, meaning that you could have conflicting writes. MongoDB is not, though, it is strongly consistent (just like a relational database).
Your application being flexible or not has absoutely nothing to do with wether you use "nosql", a "document db" or a proper RDBMS. Nobody using your application will care either way.
If you need flexibility while coding, you should research into frameworks, like ActiveRecord for Ruby, which can make DB-interfacing much more simple, generic and powerful. At that level, you can gain alot more than just changing the DB, and you can even become DB-agnostic, meaning you can change DB without changing any code. Indeed, I have found ActiveRecord to boost my productivity many many fold by alleviating me from tedious and error-prone "code intermixed with SQL".
Indeed, if you feel you need a schemaless database, for critical data, you are doing something wrong. You are putting your own convenience over critical needs of the projects, or in ignorance thinking you won't get into problems later. Sooner or later, lack of consistency will bite your ass, hard!
I feel you are hesistant towards RDBMS because you are not really that comfortable with all the jargons, syntax and sound CS principles.
Believe me, if you're going to create an application of any value, you are hundred times better learning SQL, ACID and good database-principles in the first place. Just read up on the basics, and build your knowledge from wherever you are now. It's the same for each and every one of us, it takes time, but you learn to do things right from the start.
Low-level tools like MongoDB and equivalent just provide you with infinitely more ammunition to shoot yourself in the foot with. They make it seem easy. In reality however, they leave the hard work for you, the coder, to deal with, while an RDBMS will handle more of the cruft for you once you grok the basics.
Why use computers at all, if you want more work, you can just go back to paper. Design will be a breeze, and implementation can be super-quick. Done. Except it won't be right of course.
In the real world, we can't afford to ignore consistency, database design and many more sound CS principles. Which is why it's a great idea to study them in the first place, and keep learning more and more.
Don't buy into the hype. You ask question about MongoDB here, but include that you really need its features. With 25 years of computer experience, I simply don't buy it. If you think creatively, an RDBMS can be made to do whatever you want it to, or a framework can be utilized to save you from errors and wasted time.
Crafting ACID properties onto MongoDB seems like more work to me, and by experience, sounds like an excercise in futility, rather than using what is already designed to suit such purposes.
Is it possible? Sure. Why not? It's possible to store everything as XML in text files if you wanted to.
Is it the best idea? It depends on what you're trying to do, the rest of your system architecture, the scale your site is intended to achieve, and so on.
We have thought a bit about running a noSQL database for our next project. However, we're not sure about which platform that will give us the best possible availability and has the best built-in replication features/functions to provide this - with the least headache.
Right now, Cassandra appears as the best candidate, but we would like to hear more about this from someone that have more experience in this area, then we do.
Thanks a lot!
High availablity will most likely be achieved with a Dynamo clone.
Cassandra is a good option although it has been bashed recently by several early adapters.
Project Voldemort is also Dynamo-based and therefore easily optimized for high-availability, it's what LinkedIn are using.
Another interesting noSQL option might be membase, I myself didn't use it but their notion of virtual buckets for rebalancing as opposed to just consistent hashing makes a lot of sense and would appear to provide more robust high-availability.