I an new to Big Data; obviously most applications using NoSQL frameworks such as MongoDB, CouchDb, and Cassandra require access to huge amount of data. Now, my question is if all these NoSQL tools use Hadoop file system as their storage, or some how file system of their own?
If they use Hadoop file system, then do they have an easy way to integrate with Hadoop file system?
Thanks
No, they do not use HDFS by default. Many of the NoSQL databases have been made to scale out well. That is, the data can be separated onto a bunch of regular non-HDFS machines and if configured correctly (in some cases this could be a big if) they will operate efficiently.
So they do not use HDFS for their scaling systems, but they can be integrated with Hadoop
Documentation and Webinar about MongoDB and Hadoop.
Blog about CouchDB and Hadoop.
Documentation about Cassandra and Hadoop.
Related
I'm in the process of developing my next app, and I'm really interested in using polyglot persistence. I like the idea of being able to query different data structures for different services. I'm essentially wanting to sync MongoDB, Neo4j/Titan, SQL, and maybe Cassandra/Hbase.
Currently, I'm wrapping everything in a try/catch block and rolling them all back if one fails. However, this is taxing my write times. I've also looked into AMQP systems like Kafka or ZeroMQ, but these seem more big data centric, whereas my app is still small and I want to keep it efficient.
Has anyone had experience with this? Is a MQ a good idea for a small app or am I prematurely optimizing?
Thanks
I know quite a bit about ZeroMQ, but not a lot about the database servers you mention.
First, you're a bit confused about ZeroMQ. Although it is derived from experience with AMQP, it uses the ZMTP wire protocol. That was custom designed during ZeroMQ development [but other applications do now use it].
ZeroMQ is a small and very fast MQ library that is symmetrical for all nodes; it is very good for small apps. The problem here is that you need something on the other systems that talks ZMTP, whether it's ZeroMQ or a bridge. If you intend making plugins or the like for the other systems then fine.
I presume though, that you are using JMS to talk to the other systems without intending to develop add-ons for them. In which case you're probably stuck with JMS. Kafka is a new one that I haven't caught up with, but RabbitMQ is a good, fast, and small, broker. FWIW. There are a great many broker comparisons out there for you to find. Many are dodgy in the sense that one small tweak of a setting can affect the performance greatly and are not necessarily comparing apples with apples. If want to compare broker performance in your environment, there isn't much of a shortcut to doing it yourself.
One thing that is confusing me is how you expect a broker to help your rollback performance. You'll still need to do the rollback in essentially the same manner, albeit asyncronously via the broker.
I work at CloudBoost.io (https://www.cloudboost.io) and we build a layer that sits on top of databases and give you the power of Polyglot Persistence persistence. We integrate MongoDB, ElasticSearch, Redis, Cassandra, and Neo4j and give you the one single APIwhere you can query / store your data. We automatically shard your data into various databases based on your query / storage request patterns.
Let me know if this helps. :)
For syncing data from MongoDB to Neo4j there is now the Neo4j Doc Manager project. It works by monitoring MongoDB for operations and converts the document operation into a property graph model and immediately writes to Neo4j.
What I got from the documentation is, It runs as a separate process on other machine and I can communicate with it using the mongo db client driver for java and I can do normal operations.
But I doubt If I can use MongoDB as an embedded db in my java application? I mean, not as a separate process on other machine or not as a separate process on the same machine. It should be part of java application.
Can you please help me out?
No, that's not possible. MongoDB is a native C++ application that uses memory-mapped files, opens sockets, etc. It won't run in a JVM.
Also, MongoDB was made for web scale applications, big data, failover clusters (replica sets) and auto-sharding, none of which really make sense in an embedded application. Also, it's quite aggressive in terms of memory usage which is undesirable for embedded applications.
--EDIT after zero323's comment--
You might want to take a look at db4o an object database for java that was made for embedding.
Also, when embedding databases, the licenses can bite you and force you to release your code under the same license, in case of MongoDB the AGPL.
I understand it may seem a redundant question, but I hope it will clarify all my and other users doubts.
Here are the noSQL I am talking about can someone explain me:
The best use cases (When I should use it)
The pros and the cons (Limitations included)
Their added value (why it is better, possibly a mathematical/scientific explanation)
of MongoDB, Redis, CouchDB, Hadoop
Thanks
g
MongoDb and CouchDb are not key-value storages, but document-stores.
The best way to clarify doubts - reading technical documentation and overview =)
In short - MongoDb and CouchDb are fast enough, reliable key-value storages, that persist
data to disc. MongoDb works over custom TCP/IP protocol, CouchDb use REST approach over HTTP
Redis is another kind of ket/value storage, that stores all data in memory, all writes and reads go directly to memory. This approach has some drawbacks and benefits as well. It persists changes at disc too
Hadoop is not just a key/value storage. It's an engine for the distributed processing of large data. If you are going to write Google you can use Hadoop.
In my opinion, if you are not going to build something specific (i think you won't), go ahead and use MongoDB.
I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context
Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.
What are the major differences between MongoDB and CouchDB, and are there any other major NO-SQL database-servers out there worth mentioning?
I know that CERN uses CouchDB somewhere in their LHC back-end; huge stamp of approval. What are MongoDB - and any other major servers' - references?
Update
One of the major selling points of CouchDB, to me, is the REST-based API and seamless JavaScript integration using JSON as a data-wrapper. Is this possible with any of the other NO-SQL databases mentioned?
There are many more differences, but some quick points:
CouchDB has MVCC (Multi Version Concurrency Control) - each time a document is updated, a NEW version of it is created. Whereas MongoDB is update-in-place.
CouchDB has support for multi-master, so you can write to any server. MongoDB only has 1 server active for write (master-slave) - However: I this this may have changed in the latest release (1.6) so MongoDB may now support multiple servers for writes
To see who's using MongoDB see here (e.g. foursquare, bit.ly, sourceforge....)
To see who's using CouchDB see here.
The most notable other NoSQL database is Cassandra (facebook, twitter)
Then you have HBase, HyperTable, RavenDB, SimpleDB, and more still...
Welcome to some new ground #AdaTheDev covered most of the major ones. There's also Project Voldemort, Tokyo Cabinet/Tyrant, and a whole bunch of wrappers around all of these things. So people are also building MemcacheDB (memcache with a persistence layer).
MongoDB has several hooks to support "REST" APIs (check out "Sleepy Mongoose" and Node.js support). MongoDB and CouchDB have different ways of handling map-reduces (though they are somewhat similar). MongoDB does not have MVCC, but the two systems really have different ways of storing data each with their own set of trade-offs.
MongoDB uses language-specific drivers where CouchDB uses REST (performance trade-off).
For more detailed comparison look here.
MongoDB is probably a little easier for a relational developer to grasp since it uses drivers and has better support for ad hoc queries. CouchDB has very little in common with the old relational ways of doing things.
Both deal with sharding and replication differently.
Having said that, I believe both are conceptually similar enough that it often boils down to personal preference. They are all fun to code with. In fact, we evaluated both for an internal project and went back and forth with our decision.