Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Use Case: Currently we have "offices" in places around the world with very intermittent internet access. Sometimes it's great, but sometimes it can go off for hours at a time.
Right now we are using CouchDB that has a master database in the cloud and we have documents with an office_id. We then do a filtered sync to all of these "offices" to only send over documents that have that office_id and that are less than a month old.
With CouchDB you can edit these documents and add new ones on the offline CouchDB server in these offices. At this time, we have a cron that does a replication sync to our master database in the cloud every 5 minutes or so.
Problem: While CouchDB makes it really easy to sync, I'm afraid of some scalability issues with CouchDB. Before it gets too late, I'm trying to explore different database avenues and ways I could do this.
Amazon seems to be doubling down with their DocumentDB offering that supports MongoDB so I'm curious: has anyone done multi-master syncing with MongoDB or an NoSQL equivalent?
I don't want to get into a scenario where scaling puts me in a corner.
Amazon DocumentDB is using shared storage which isn't at all what you are after, and doesn't solve your problem. MongoDB would be a very poor choice for your scenario, and master-master replication is a really hard problem to deal with. CouchDB (like you use) is the first that pops to mind, but you should really search for that explicit feature if you are looking for a replacement. Also note that a lot of multi-master setups makes the assumption that a partitioning occur between masters, but the clients can still connect to all or some masters, which isn't your case because clients only have a single valid master.
Another option would be to build your replication yourself using a queue system or similar, but that would require even more infrastructure on each location (since the problem is the internet connection going down), and that would only be "easy" if different offices rarely or never edit the same documents, because manually having to deal with merging is a pain.
You don't explain what your scaling worries are, but I don't really see MongoDB or any other NoSQL database to have that much different scalability traits than CouchDB.
EDIT: What you are actually after is Optimistic Replication (or Lazy Replication, Eventual Consistency)
Since you mentioned a "NoSQL equivalent", I would like to explain how Couchbase can accomplish this in 2 different ways:
1) Cross Data Center Replication (XDCR) - Allow Clusters of different sizes to synchronize the data between them. The replication can be paused/resumed without any issues (conflicts can be solved via timestamp or document revision). Replications can be uni and bidirectional and you can also filter which documents should be syncronized between clusters using Filtering Expressions (A simplified query system) https://docs.couchbase.com/server/6.5/xdcr-reference/xdcr-filtering-expressions.html
2) Sync Gateway - It is a middleware originally designed to synchronize your main database with a database on the edge. In this architecture, we assume that the connection will be down most of the time. https://blog.couchbase.com/couchbase-mobile-embedded-java-write-throughput/ . Your application could simply consume sync-gateway stream and insert the changes in the cluster replica. (Although I think XDCR should be enough for you)
Couchbase started as a fork of CouchDB a long time ago and most of the code has been rewritten, but some core concepts are still present in Couchbase.
Finally, a big plus of moving to Couchbase is that it is a distributed and highly scalable database ( from 1 to > 100 nodes in a single cluster) and you will be able to query your data using N1QL https://query-tutorial.couchbase.com/tutorial/#1
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to implement mongodb as a distributed database but i cannot find good tutorials for it. Whenever i searched for distributed database in mongodb, it gives me links of sharding, so i am confused if both of them are the same things?
Generally speaking, if you got a read-heavy system, you may want to use replication. Which is 1 primary with at most 50 secondaries. The secondaries share the read stress while the primary takes care of writes. It is a auto-failover system so that when the primary is down, one of the secondaries would take the job there and becomes a new primary.
Sharding, however, is more flexible. All the Shards share write stress and read stress. That is to say, data are distributed into different Shards. And each shard can be consists of a Replication system and auto-failover works as described above.
I would choose replication first because it's simple and is basically enough for most scenarios. And once it's not enough, you can choose to convert from replication to sharding.
There is also another discussion of differences between replication and sharding for your reference.
Just some perspective on distributed databases:
In early nineties a lot of applications were desktop based and had a local database which contained MB/GBs of data.
Now with the advent of web based applications there can be millions of users who use and store their data, this data can run into GB/TB/PB. Storing all this data on a single server is economically expensive so there is a cluster of servers(or commodity hardware) across which data is horizontally partitioned. Sharding is another term for horizontal partitioning of data.
For example you have a Customer table which contains 100 rows, you want to shard it across 4 servers, you can pick 'key' based sharding in which customers will be distributed as follows: SHARD-1(1-25),SHARD-2(26-50),SHARD-3(51-75),SHARD-4(76-100)
Sharding can be done in 2 ways:
Hash based
Key based
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working on a solution deployed on an Amazon EC2 server instance that has its region set to US WEST. The solution uses mongodb for data storage and contains a web service that is used by a mobile application. The user base of the mobile application is split 40:60 between US and Asia, as such I need to set up another EC2 instance in the Asia Pacific region to lower their latency and connection time.
Since the data storage is located on an instance in US WEST, how would I set up a new instance in asia pacific that can share the same data with the US WEST instance? I am open to moving the mongodb databse elsewhere but I do not want to change to a different NoSQL soltion.
There are various different solutions here. I will try to provide a few.
Replica Sets
Perhaps the easiest solution would be to use a Replica Set, where you have two servers in US-EAST and one in ASIA. Replica Sets in MongoDB require a minimum of three nodes to work and as you have a higher amount of users near US-EAST it makes sense to put it there.
Now, with just the three nodes you only solve having the data available closer to ASIA with one of the nodes. You then need to use Read Preferences, to instruct your application to either read from one of the US-EAST or ASIA nodes. I have written an article about how PHP deals with those Read Preferences at http://derickrethans.nl/readpreferences.html — other language drivers will have a similar solution.
All drivers will maintain connections to each of the Replica Set nodes, so connection overhead should not be too much of a problem. But at least you can do reads from a node closest by to solve latency. Writes still always have to go to a primary (which will likely be in US-EAST).
Pros: Fairly easy to set-up, only three nodes required
Cons: Only good for directing reads, but not writes
Sharding
Sharding is a method in MongoDB that allows you to separate your whole data set into smaller piece so that is possible to fit a huge dataset into MongoDB, without having the constraints of the resources of one server. Typically, a sharded set-up consists of at least two shard, each containing a (3 node) replica set, but it also possible to have a replicaset consist of only one node which means you'd end up having two shards, each containing one data node.
Sharding in MongoDB supports "Tag Aware Sharding" (http://mongodb.onconfluence.com/display/DOCS/Tag+Aware+Sharding) which makes it possible to redirect specific documents to specific shards depending on a field in your document. If your documents f.e. have a range of user ideas, or country codes, you can use that to redirect documents to the correct shard.
Setting this up is however not very easy as it requires quite a good understanding of sharding with MongoDB. There is a really nice introduction at http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Pros: Allows you to have data localized to one specific location for both read and write
Cons: Not easy to setup, you need two data nodes, config servers and proxy servers.
Hopes this helps!
If your application is read heavy I would use mongodb's Replica sets:
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the strengths and weaknesses of the various NoSQL databases available?
In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?
The strengths and weaknesses of the NoSQL databases (and also SQL databases) is highly dependent on your use case. For very large projects, performance is king; but for brand new projects, or projects where time and money are limited, simplicity and time-to-market are probably the most important. For teaching yourself (broadening your perspective, becoming a better, more valuable programmer), perhaps the most important thing is simple, solid fundamental concepts.
What kind of project do you have in mind?
Some strengths and weaknesses, off the top of my head:
Redis
Very simple key-value "global variable server"
Very simple (some would say "non-existent") query system
Easily the fastest in this list
Transactions
Data set must fit in memory
Immature clustering, with unclear future (I'm sure it'll be great, but it's not yet decided.)
Cassandra
Arguably the most community momentum of the BigTable-like databases
Probably the easiest of this list to manage in big/growing clusters
Support for map/reduce, good for analytics, data warehousing
MUlti-datacenter replication
Tunable consistency/availability
No single point of failure
You must know what queries you will run early in the project, to prepare the data shape and indexes
CouchDB
Hands-down the best sync (replication) support, supporting master/slave, master/master, and more exotic architectures
HTTP protocol, browsers/apps can interact directly with the DB partially or entirely. (Sync is also done over HTTP)
After a brief learning curve, pretty sophisticated query system using Javascript and map/reduce
Clustered operation (no SPOF, tunable consistency/availability) is currently a significant fork (BigCouch). It will probably merge into Couch but there is no roadmap.
Similarly, clustering and multi-datacenter are theoretically possible (the "exotic" thing I mentioned) however you must write all that tooling yourself at this time.
Append only file format (both databases and indexes) consumes disk surprisingly quickly, and you must manually run compaction (vacuuming) which makes a full copy of all records in the database. The same is required for each index file. Again, you have to be your own toolsmith.
Take a look at http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis He does a good job summing up why you would use one over the other.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am a complete noob when it comes to the NoSQL movement. I have heard lots about MongoDB and CouchDB. I know there are differences between the two. Which do you recommend learning as a first step into the NoSQL world?
See following links
CouchDB Vs MongoDB
MongoDB or CouchDB - fit for production?
DB-Engines - Comparison CouchDB vs. MongoDB
Update: I found great comparison of NoSQL databases.
MongoDB (3.2)
Written in: C++
Main point: JSON document store
License: AGPL (Drivers: Apache)
Protocol: Custom, binary (BSON)
Master/slave replication (auto failover with replica sets)
Sharding built-in
Queries are javascript expressions
Run arbitrary javascript functions server-side
Has geospatial indexing and queries
Multiple storage engines with different performance characteristics
Performance over features
Document validation
Journaling
Powerful aggregation framework
On 32bit systems, limited to ~2.5Gb
Text search integrated
GridFS to store big data + metadata (not actually an FS)
Data center aware
Best used: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
For example: For most things that you would do with MySQL or PostgreSQL, but having predefined columns really holds you back.
CouchDB (1.2)
Written in: Erlang
Main point: DB consistency, ease of use
License: Apache
Protocol: HTTP/REST
Bi-directional (!) replication,
continuous or ad-hoc,
with conflict detection,
thus, master-master replication. (!)
MVCC - write operations do not block reads
Previous versions of documents are available
Crash-only (reliable) design
Needs compacting from time to time
Views: embedded map/reduce
Formatting views: lists & shows
Server-side document validation possible
Authentication possible
Real-time updates via '_changes' (!)
Attachment handling
Best used: For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments.
If you are coming from the MySQL world, MongoDB is going to "feel" a lot more natural to you because of its query-like language support.
I think that is what makes it so friendly for a lot of people.
CouchDB is fantastic if you want to utilize the really great master-master replication support with a multi-node setup, possibly in different data centers or something like that.
MongoDB's replication (replica sets) is a master-slave-slave-slave-* setup, you can only write to the master in a replica set and read from any of them.
For a standard site configuration, that is fine. It maps to MySQL usage really well.
But if you are trying to create a global service like a CDN that needs to keep all global nodes synced even though read/write to all of them, something like the replication in CouchDB is going to be a huge boon to you.
While MongoDB has a query-like language that you can use and feels very intuitive, CouchDB takes a "map-reduce" approach and this concepts of views. It feels odd at first, but as you get the hang of it, it really starts feeling intuitive.
Here is a quick overview so it makes some sense:
CouchDB stores all your data in a b-tree
You cannot "query" it dynamically with something like "SELECT * FROM user WHERE..."
Instead, you define discrete "views" of your data... "here is a view of all my users", "here is a view of all users older than 10" "here is a view of all users older than 30" and so on.
These views are defined using map-reduce approach and are defined as JavaScript functions.
When you define a view, the DB starts feeding all the documents of the DB you assigned the view to, through it and recording the results of your functions as the "index" on that data.
There are some basic queries you can do on the views like asking for a specific key (ID) or range of IDs regardless of what your map/reduce function does.
Read through these slides, it's the best clarification of map/reduce in Couch I've seen.
So both of these sources use JSON documents, but CouchDB follows this more "every server is a master and can sync with the world" approach which is fantastic if you need it, while MongoDB is really the MySQL of the NoSQL world.
So if that sounds more like what you need/want, go for that.
Little differences like Mongo's binary protocol vs the RESTful interface of CouchDB are all minor details.
If you want raw speed and to hell with data safety, you can make Mongo run faster than CouchDB as you can tell it to operate out of memory and not commit things to disk except for sparse intervals.
You can do the same with Couch, but it's HTTP-based communication protocol is going to be 2-4x slower than raw binary communication with Mongo in this "speed over everything!" scenario.
Keep in mind that raw crazy insane speed is useless if a server crash or disk failure corrupts and toasts your DB into oblivion, so that data point isn't as amazing as it might seem (unless you are doing real-time trading systems on Wall Street, in which case look at Redis).
Hope that all helps!
Have a look at these links:
MongoDB vs CouchDB (from MongoDB side)
CouchDB vs MongoDB: An attempt for a More Informed Comparison
CouchDB vs. MongoDB Benchmark(perfomance comparison)
There are now many more NoSQL databases on the market than ever before. I suggest even having a look at the Gartner Magic Quadrant if you're looking for a database that will also be great for enterprise applications based on support, expandability, management, and cost.
http://www.gartner.com/technology/reprints.do?id=1-23A415Q&ct=141020&st=sb
I would like to suggest Couchbase to anyone who's not tried it yet, but not based on the version that is shown in the report (2.5.1) because it is nearly 2 revisions behind where CB Server is today, nearing release of 4.0 in 2H15.
http://www.couchbase.com/coming-in-couchbase-server-4-0
The other part about Couchbase as a vendor/product is that it is a multi-use type of DB. It can act as a pure K/V store, Document Oriented Database with multi-dimensional scaling, Memcached, cache-aside with persistence, and supports ANSI 92 compliant SQL with automatic joins, replication to DR clusters with the push of a button, and even has a mobile component built-in to the ecosystem.
If nothing else, it's worth checking out the latest benchmarks:
http://info.couchbase.com/Benchmark_MongoDB_VS_CouchbaseServer_HPW_BM.html
http://info.couchbase.com/NoSQL-Technical-Comparison-Report.html
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.