Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am working on a solution deployed on an Amazon EC2 server instance that has its region set to US WEST. The solution uses mongodb for data storage and contains a web service that is used by a mobile application. The user base of the mobile application is split 40:60 between US and Asia, as such I need to set up another EC2 instance in the Asia Pacific region to lower their latency and connection time.
Since the data storage is located on an instance in US WEST, how would I set up a new instance in asia pacific that can share the same data with the US WEST instance? I am open to moving the mongodb databse elsewhere but I do not want to change to a different NoSQL soltion.
There are various different solutions here. I will try to provide a few.
Replica Sets
Perhaps the easiest solution would be to use a Replica Set, where you have two servers in US-EAST and one in ASIA. Replica Sets in MongoDB require a minimum of three nodes to work and as you have a higher amount of users near US-EAST it makes sense to put it there.
Now, with just the three nodes you only solve having the data available closer to ASIA with one of the nodes. You then need to use Read Preferences, to instruct your application to either read from one of the US-EAST or ASIA nodes. I have written an article about how PHP deals with those Read Preferences at http://derickrethans.nl/readpreferences.html — other language drivers will have a similar solution.
All drivers will maintain connections to each of the Replica Set nodes, so connection overhead should not be too much of a problem. But at least you can do reads from a node closest by to solve latency. Writes still always have to go to a primary (which will likely be in US-EAST).
Pros: Fairly easy to set-up, only three nodes required
Cons: Only good for directing reads, but not writes
Sharding
Sharding is a method in MongoDB that allows you to separate your whole data set into smaller piece so that is possible to fit a huge dataset into MongoDB, without having the constraints of the resources of one server. Typically, a sharded set-up consists of at least two shard, each containing a (3 node) replica set, but it also possible to have a replicaset consist of only one node which means you'd end up having two shards, each containing one data node.
Sharding in MongoDB supports "Tag Aware Sharding" (http://mongodb.onconfluence.com/display/DOCS/Tag+Aware+Sharding) which makes it possible to redirect specific documents to specific shards depending on a field in your document. If your documents f.e. have a range of user ideas, or country codes, you can use that to redirect documents to the correct shard.
Setting this up is however not very easy as it requires quite a good understanding of sharding with MongoDB. There is a really nice introduction at http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Pros: Allows you to have data localized to one specific location for both read and write
Cons: Not easy to setup, you need two data nodes, config servers and proxy servers.
Hopes this helps!
If your application is read heavy I would use mongodb's Replica sets:
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Use Case: Currently we have "offices" in places around the world with very intermittent internet access. Sometimes it's great, but sometimes it can go off for hours at a time.
Right now we are using CouchDB that has a master database in the cloud and we have documents with an office_id. We then do a filtered sync to all of these "offices" to only send over documents that have that office_id and that are less than a month old.
With CouchDB you can edit these documents and add new ones on the offline CouchDB server in these offices. At this time, we have a cron that does a replication sync to our master database in the cloud every 5 minutes or so.
Problem: While CouchDB makes it really easy to sync, I'm afraid of some scalability issues with CouchDB. Before it gets too late, I'm trying to explore different database avenues and ways I could do this.
Amazon seems to be doubling down with their DocumentDB offering that supports MongoDB so I'm curious: has anyone done multi-master syncing with MongoDB or an NoSQL equivalent?
I don't want to get into a scenario where scaling puts me in a corner.
Amazon DocumentDB is using shared storage which isn't at all what you are after, and doesn't solve your problem. MongoDB would be a very poor choice for your scenario, and master-master replication is a really hard problem to deal with. CouchDB (like you use) is the first that pops to mind, but you should really search for that explicit feature if you are looking for a replacement. Also note that a lot of multi-master setups makes the assumption that a partitioning occur between masters, but the clients can still connect to all or some masters, which isn't your case because clients only have a single valid master.
Another option would be to build your replication yourself using a queue system or similar, but that would require even more infrastructure on each location (since the problem is the internet connection going down), and that would only be "easy" if different offices rarely or never edit the same documents, because manually having to deal with merging is a pain.
You don't explain what your scaling worries are, but I don't really see MongoDB or any other NoSQL database to have that much different scalability traits than CouchDB.
EDIT: What you are actually after is Optimistic Replication (or Lazy Replication, Eventual Consistency)
Since you mentioned a "NoSQL equivalent", I would like to explain how Couchbase can accomplish this in 2 different ways:
1) Cross Data Center Replication (XDCR) - Allow Clusters of different sizes to synchronize the data between them. The replication can be paused/resumed without any issues (conflicts can be solved via timestamp or document revision). Replications can be uni and bidirectional and you can also filter which documents should be syncronized between clusters using Filtering Expressions (A simplified query system) https://docs.couchbase.com/server/6.5/xdcr-reference/xdcr-filtering-expressions.html
2) Sync Gateway - It is a middleware originally designed to synchronize your main database with a database on the edge. In this architecture, we assume that the connection will be down most of the time. https://blog.couchbase.com/couchbase-mobile-embedded-java-write-throughput/ . Your application could simply consume sync-gateway stream and insert the changes in the cluster replica. (Although I think XDCR should be enough for you)
Couchbase started as a fork of CouchDB a long time ago and most of the code has been rewritten, but some core concepts are still present in Couchbase.
Finally, a big plus of moving to Couchbase is that it is a distributed and highly scalable database ( from 1 to > 100 nodes in a single cluster) and you will be able to query your data using N1QL https://query-tutorial.couchbase.com/tutorial/#1
I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!
Sitecore uses MongoDB for tracking and analytics. If the production environment is split into several geographic locations, particularly in different continents, how should xDB be configured? If xDB can only have one writeable primary instance in any replica set, does this not force all front-end CD servers globally to write to the same node in one particular data centre? This doesn't seem ideal.
Voted to move this question to https://dba.stackexchange.com/
You are correct, eventually all of the data comes down to a "one" Mongo, but that replica set itself can be geographically distributed as well. You can also shard MongoDB if you wish to. See the MongoDB manual on geographic redundancy for information on this type of scaling. From the manual:
While replica sets provide basic protection against single-instance failure, replica sets whose members are all located in a single facility are susceptible to errors in that facility. Power outages, network interruptions, and natural disasters are all issues that can affect replica sets whose members are colocated. To protect against these classes of failures, deploy a replica set with one or more members in a geographically distinct facility or data center to provide redundancy.
Also note that with xDB in a geograpically distributed environment, you'll need to have a session state server for each CD cluster. This gathers all the user's information during the session, prior to the session completion flush to the collection database. The Sitecore guide on 'clustered environments' has a diagram and some information on this geographic configuration. From the guide:
Each cluster could contain two or more content delivery instances, each with its own session state server. You could also group clusters together in the same place or spread them across different geographical locations.
I have asked platform developers exactly this question on Sitecore User Group London a week ago, they responded that all the data for xDB is internally kept in UTC format.
We also have had problem of servers in different time zones previously, but that time it affected Event Queue (they did not work when in different time zones), so keeping all your servers in the same time zone with time synchronized would do the job.
Here is official guidance for that from Sitecore:
https://doc.sitecore.net/sitecore%20experience%20platform/utc/settings%20supporting%20utc%20implementation
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to implement mongodb as a distributed database but i cannot find good tutorials for it. Whenever i searched for distributed database in mongodb, it gives me links of sharding, so i am confused if both of them are the same things?
Generally speaking, if you got a read-heavy system, you may want to use replication. Which is 1 primary with at most 50 secondaries. The secondaries share the read stress while the primary takes care of writes. It is a auto-failover system so that when the primary is down, one of the secondaries would take the job there and becomes a new primary.
Sharding, however, is more flexible. All the Shards share write stress and read stress. That is to say, data are distributed into different Shards. And each shard can be consists of a Replication system and auto-failover works as described above.
I would choose replication first because it's simple and is basically enough for most scenarios. And once it's not enough, you can choose to convert from replication to sharding.
There is also another discussion of differences between replication and sharding for your reference.
Just some perspective on distributed databases:
In early nineties a lot of applications were desktop based and had a local database which contained MB/GBs of data.
Now with the advent of web based applications there can be millions of users who use and store their data, this data can run into GB/TB/PB. Storing all this data on a single server is economically expensive so there is a cluster of servers(or commodity hardware) across which data is horizontally partitioned. Sharding is another term for horizontal partitioning of data.
For example you have a Customer table which contains 100 rows, you want to shard it across 4 servers, you can pick 'key' based sharding in which customers will be distributed as follows: SHARD-1(1-25),SHARD-2(26-50),SHARD-3(51-75),SHARD-4(76-100)
Sharding can be done in 2 ways:
Hash based
Key based
We are currently designing a SaaS application that has a subscriber/user based mode of operation. For example, a single subscriber can have 5, 10 or up to 25 users in their account dependent on what type of package they are on.
At the moment we are going with a single database per tenant approach. This has several advantages for us from the standpoint of the application.
I have read about the connection limits associated with Mongo and I am a little confused and worried. I was hoping someone could clarify it for me in simple terms, as I haven't worked much with Mongo.
From what I understand, there is a hard limit of 20,000 connections available on the mongod process and the mongos processes.
How does that translate to this multi-tenant approach? I am trying to basically asses how I would deploy the application in general in terms of replica sets and if sharding is necessary such that I don't hit these limits. How does one handle such a scenario for example if you have 10,000 tenants with multiple users that would exceed the limit.
Our application generally wouldn't need sharding as the each tenants collection wouldn't reach the point where it would need to be sharded. From what I understand though, MongoDB will create databases in a round robin fashion on each shard and will distribute the load which may help with the connection issues.
This is me just trying to make sense of what I've read and I'm hoping someone can clear this up for me.
Thanks in advance
EDIT
If I just add replica sets, will this alleviate this problem? Even though only the primary can accept writes from what I understand?
You just have to store a database connection in a pool and reuse it if you access the same database again. This will limit the number of connections to a reasonable number unless you aren't using 10,000s of databases which wouldn't be a good idea anyway.