Best suited NoSQL database for Content Recommender - nosql

I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context

Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.

Related

Polyglot persistence for startup app

I'm in the process of developing my next app, and I'm really interested in using polyglot persistence. I like the idea of being able to query different data structures for different services. I'm essentially wanting to sync MongoDB, Neo4j/Titan, SQL, and maybe Cassandra/Hbase.
Currently, I'm wrapping everything in a try/catch block and rolling them all back if one fails. However, this is taxing my write times. I've also looked into AMQP systems like Kafka or ZeroMQ, but these seem more big data centric, whereas my app is still small and I want to keep it efficient.
Has anyone had experience with this? Is a MQ a good idea for a small app or am I prematurely optimizing?
Thanks
I know quite a bit about ZeroMQ, but not a lot about the database servers you mention.
First, you're a bit confused about ZeroMQ. Although it is derived from experience with AMQP, it uses the ZMTP wire protocol. That was custom designed during ZeroMQ development [but other applications do now use it].
ZeroMQ is a small and very fast MQ library that is symmetrical for all nodes; it is very good for small apps. The problem here is that you need something on the other systems that talks ZMTP, whether it's ZeroMQ or a bridge. If you intend making plugins or the like for the other systems then fine.
I presume though, that you are using JMS to talk to the other systems without intending to develop add-ons for them. In which case you're probably stuck with JMS. Kafka is a new one that I haven't caught up with, but RabbitMQ is a good, fast, and small, broker. FWIW. There are a great many broker comparisons out there for you to find. Many are dodgy in the sense that one small tweak of a setting can affect the performance greatly and are not necessarily comparing apples with apples. If want to compare broker performance in your environment, there isn't much of a shortcut to doing it yourself.
One thing that is confusing me is how you expect a broker to help your rollback performance. You'll still need to do the rollback in essentially the same manner, albeit asyncronously via the broker.
I work at CloudBoost.io (https://www.cloudboost.io) and we build a layer that sits on top of databases and give you the power of Polyglot Persistence persistence. We integrate MongoDB, ElasticSearch, Redis, Cassandra, and Neo4j and give you the one single APIwhere you can query / store your data. We automatically shard your data into various databases based on your query / storage request patterns.
Let me know if this helps. :)
For syncing data from MongoDB to Neo4j there is now the Neo4j Doc Manager project. It works by monitoring MongoDB for operations and converts the document operation into a property graph model and immediately writes to Neo4j.

Why mongoDB is used in the MEAN stack

I'm fixing the usability/documentation for the mean stack. I'm starting with Mean.JS. Can someone give me the salient reasons why the authors of the MEAN stack use MongoDB as the database? There are other databases to choose, but MongoDB is used for some reason.
I realize there are questions already covering databases, but I'm wondering specifically why it was used in the MEAN stack scenario.
It think the primary reason is that MongoDB uses the same language Javascript (ECMA Script) for methods and functions API, rather than a separate language (like SQL). Thus MongoDB is a good no SQL database option, and it works much more efficiently as a database for the rest of the stack.
As others have pointed out, there are many other reasons, like that it is the most popular NoSQL database at this point. It has a decent shell and you can write Javascript in it. It is Open Source and well documented.
It is also really easy to setup, and scales fairly well, although not as good as some other NoSQL databases.
It also uses BSON, which is similar to JSON, which is similar to a Javascript object. So it is just plain easy to learn and easy to use this particular database with the rest of a Javascript stack.
There's some pretty good reasons here: http://blog.mongodb.org/post/49262866911/the-mean-stack-mongodb-expressjs-angularjs-and
A Glimpse Into Four Key Components - How MEAN Stack Adds New Dimensions To New-Age Web Applications
All four components of MEAN Stack are popular in the app development space. It offers a platform that enables an effortless development work process. Let’s know about every component and its unique features.
MongoDB – Independent database framework
For any web app building, data storage and management are essential. MongoDB is a popular database with NoSQL document to allow this purpose. The primary use case of this framework is to enable data storage and management of every web application development.(Read More)

Is CouchDB a good persistent layer for Membase?

Membase is great for social game due to it's low latency.
As I understand CouchDB is a MVCC system using b+ tree, with a focus on append only design.
(http://guide.couchdb.org/draft/btree.html)
One of the most important scenario of Membase is social game.
Social game has a lot of write operations (50+%).
And a good portion of them are in-place updates.
So why is CouchDB a suitable persistent layer for Membase?
I'd also add that CouchDB's append-only log format really doesn't have much relation to whether application writes are new items or updates. The append-only format gives us much better reliability and performance than an in-place system (like sqlite...which is still quite reliable). It's also much easier to take backups of.
Does Membase NEED an append-only log format? maybe not...does it NEED CouchDB?...YES!
The benefits of map-reduce and indexing as well as eventually consistent replication that CouchDB brings are nothing less than huge for Membase...and the benefits of low-latency, clustering and UI that Membase brings to CouchDB are arguably just as important.
(Disclosure: I work for Couchbase)
Perry Krug
CouchDB has great file formats, great ability to recover from crashes, sophisticated authentication and authorization tools, and a universal, standard, interface: HTTP. CouchDB is poor at low-latency queries, optimized memory utilization, and heavy update speeds (a million per second).
Membase currently has only a simple SQLite file format for persistence, less sophisticated authentication and authorization, using a more obscure protocol. Membase is amazing for low-latency queries, ideal memory utilization, and heavy update speeds.
I think the two complement each other very well. Since the merging effort is coming from core developers in both projects, collaborating together, I expect to see the strengths of both and the weaknesses of neither. Yes, CouchDB is a good persistence layer for Membase.
Money speaks and if there ever was a vote of confidence then here it is, not only from a new lead investor but also from the existing ones as well.
http://www.couchbase.com/press-releases/couchbase-series-C
Besides, don't you think that Membase itself is more than well enough qualified to make an evaluation for such a merger decision?

MongoDB versus CouchDB... And any other "major players"

What are the major differences between MongoDB and CouchDB, and are there any other major NO-SQL database-servers out there worth mentioning?
I know that CERN uses CouchDB somewhere in their LHC back-end; huge stamp of approval. What are MongoDB - and any other major servers' - references?
Update
One of the major selling points of CouchDB, to me, is the REST-based API and seamless JavaScript integration using JSON as a data-wrapper. Is this possible with any of the other NO-SQL databases mentioned?
There are many more differences, but some quick points:
CouchDB has MVCC (Multi Version Concurrency Control) - each time a document is updated, a NEW version of it is created. Whereas MongoDB is update-in-place.
CouchDB has support for multi-master, so you can write to any server. MongoDB only has 1 server active for write (master-slave) - However: I this this may have changed in the latest release (1.6) so MongoDB may now support multiple servers for writes
To see who's using MongoDB see here (e.g. foursquare, bit.ly, sourceforge....)
To see who's using CouchDB see here.
The most notable other NoSQL database is Cassandra (facebook, twitter)
Then you have HBase, HyperTable, RavenDB, SimpleDB, and more still...
Welcome to some new ground #AdaTheDev covered most of the major ones. There's also Project Voldemort, Tokyo Cabinet/Tyrant, and a whole bunch of wrappers around all of these things. So people are also building MemcacheDB (memcache with a persistence layer).
MongoDB has several hooks to support "REST" APIs (check out "Sleepy Mongoose" and Node.js support). MongoDB and CouchDB have different ways of handling map-reduces (though they are somewhat similar). MongoDB does not have MVCC, but the two systems really have different ways of storing data each with their own set of trade-offs.
MongoDB uses language-specific drivers where CouchDB uses REST (performance trade-off).
For more detailed comparison look here.
MongoDB is probably a little easier for a relational developer to grasp since it uses drivers and has better support for ad hoc queries. CouchDB has very little in common with the old relational ways of doing things.
Both deal with sharding and replication differently.
Having said that, I believe both are conceptually similar enough that it often boils down to personal preference. They are all fun to code with. In fact, we evaluated both for an internal project and went back and forth with our decision.

Choosing noSQL - availability priorited

We have thought a bit about running a noSQL database for our next project. However, we're not sure about which platform that will give us the best possible availability and has the best built-in replication features/functions to provide this - with the least headache.
Right now, Cassandra appears as the best candidate, but we would like to hear more about this from someone that have more experience in this area, then we do.
Thanks a lot!
High availablity will most likely be achieved with a Dynamo clone.
Cassandra is a good option although it has been bashed recently by several early adapters.
Project Voldemort is also Dynamo-based and therefore easily optimized for high-availability, it's what LinkedIn are using.
Another interesting noSQL option might be membase, I myself didn't use it but their notion of virtual buckets for rebalancing as opposed to just consistent hashing makes a lot of sense and would appear to provide more robust high-availability.