I'm reading "Seven Databases in Seven Weeks". Could you please explain me the text below:
One downside of a distributed system can be the lack of a single
coherent filesystem. Say you operate a website where users can upload
images of themselves. If you run several web servers on several
different nodes, you must manually replicate the uploaded image to
each web server’s disk or create some alternative central system.
Mongo handles this scenario by its own distributed filesystem called
GridFS.
Why do you need replicate manually uploaded images? Does they mean some of the servers will have linux and some of them Windows?
Do all replicated data storages tend to implement own filesystem?
On the need for data distribution and its intricacies
Let us dissect the example in a bit more detail. Say you have a web application where people can upload images. You fire up your server, save the images to the local machine in /home/server/app/uploads, the users use the application. So far, so good.
Now, your application becomes the next big thing, you have tens of thousands of concurrent users and your single server simply can not handle that load any more. Luckily, aside from the fact that you store the images in the local file system, you implemented the application in a way that you could easily put up another instance and distribute the load between them. But now here comes the problem: the second instance of your application would not have access to the images stored on the first instance – bad thing.
There are various ways to overcome that. Let us take NFS as an example. Now your second instance can access the images, and even store new ones, but that puts all the images on one machine, which sooner or later will run out of disk space.
Scaling storage capacity can easily become a very expensive part of an application. And this is where GridFS comes to help. It uses the rather easy means of MongoDB to distribute data across many machines, a process which is called sharding. Basically, it works like this: Instead of accessing the local filesystem, you access GridFS (and the contained files within) via the MongoDB database driver.
As for the OS: Usually, I would avoid mixing different OSes within a deployment, if at all possible. Nowadays, there is little to no reason for the average project to do so. I assume you are referring to the "different nodes" part of that text. This only refers to the fact that you have multiple machines involved. But they perfectly can run the same OS.
Sharding vs. replication
Note The following is vastly simplified, because going into details would well exceed the scope of one or more books.
The excerpt you quoted mixes two concepts a bit and is not clear enough on how GridFS works.
Lets first make the two involved concepts a bit more clear.
Replication is roughly comparable to a RAID1: The data is stored on two or more machines, and each machine holds all data.
Sharding (also known as "data partitioning") is roughly comparable to a RAID0: Each machine only holds a subset of the data, albeit you can access the whole data set (files in this case) transparently and the distributed storage system takes care of finding the data you requested (and decides where to store the data when you save a file)
Now, MongoDB allows you to have a mixed form, roughly comparable to RAID10: The data is distributed ("partitioned" or "sharded") between two or more shards, but each shard may (and almost always should) consist of a replica set, which is an uneven number of MongoDB instances which all hold the same data. This mixed form is called a "sharded cluster with a replication factor of X", where X denotes the non-hidden members per replica set.
The advantage of a sharded cluster is that there is no single point of failure any more:
Depending on your replication factor, one or more replica set members can fail, and the cluster is still working
There are servers which hold the metadata (which part of the data is stored on which shard, for example). Those are called config servers. As of MongoDB version 3.0.x (iirc), they form a replica set themselves – not much of a problem if a node fails.
You access a sharded cluster via a the mongos sharded cluster query router of which you usually have one per instance of your application (and most often on the same server as your application instance). But: most drivers can be given multiple mongos instances to connect to. So if one of those mongos instances fails, the driver will happily use the next one you configured.
Another advantage is that in case you need to add additional storage or have more IOPS than your current system can handle, you can add another shard: MongoDB will take care of distributing the existing data between the old shards and the new shard automagically. The details on how this is done are covered in the introduction to Sharding in the MongoDB docs.
The third advantage – and the one that has the most impact, imho – is that you can distribute (and replicate) data on relatively cheap commodity hardware, whereas most other technologies offering the benefits of GridFS on a sharded cluster require you to have specialized and expensive hardware.
A disadvantage is of course that this setup only is feasible if you have a lot of data, since many machines are necessary to set up a sharded cluster:
At least 3 config servers
At least a single shard, which should consist of a replica set. The minimal setup would be two data bearing nodes plus an arbiter
But: in order to use GridFS in general, you do not even need a replica set ;).
To stay within our above example: Both instances of your application could well access the same MongoDB instance holding a GridFS.
Do all replicated data storages tend to implement own filesystem?
Replicated? Not necessarily. There is DRBD for example, which could be described as "RAID1 over ethernet".
Assuming we have the same mixup of concepts here as we had above: Distributed file systems by their very definition implement a file system.
In this case,IMHO, author was stating that each web server has own disk storage, not shared with others - having that - upload path could be /home/server/app/uploads and as it is part of server filesystem is not shared at all as a kind of security with service provider. To populate those we need to have a script/job which will sync data to other places behind the scenes.
This scenario could be a case to use GridFS with mongo.
How gridFS works:
GridFS divides the file into parts, or chunks 1, and stores each
chunk as a separate document. By default, GridFS uses a chunk size of
255 kB; that is, GridFS divides a file into chunks of 255 kB with the
exception of the last chunk. The last chunk is only as large as
necessary. Similarly, files that are no larger than the chunk size
only have a final chunk, using only as much space as needed plus some
additional metadata.
In reply to comment:
BSON is binary format, and mongo has special replication mechanism for replicating collection data (gridFS is a special set of 2 collections). It uses OpLog to send diffs toother servers in replica set. More here
Any comments welcome!
Related
I am still not able to relate in real-time how nosql is beneficial whereas we have indexes too in traditional RDBMS's. If someone can suggest columnar databases advantages in real application particularly in terms of using structure, semistructured or unstructured data.
Largely, it depends on what you want your datastore to do. If you want to be able to scale to meet storage or operational demands, a RDBMS can only take you so far.
It comes down how you can scale to meet demand. A RDBMS is really only capable of scaling vertically. That is, add more RAM, add more disk, etc. A distributed (NoSQL) database makes scaling easier by allowing you to add more machine instances. This is known as scaling horizontally.
Here's an example using Cassandra:
Let's say I have a 3 node cluster, and my keyspace (database) is also configured with a replication factor (RF) of 3. This means that each node is responsible for 100% of the data. I load my data, and it takes up 100GB of disk space (on each node). Now, while I might have 300GB of data total in my cluster, a single copy of my data is 100GB.
So my product team comes to me and says they need to double the amount of data they have. I know that I built their 3 node cluster with 200GB drives. If I did nothing, those drives would pretty much fill-up (and if they didn't they wouldn't leave room for much else).
Now it's up to me to scale the cluster to meet their space demands. I'll start by adding 3 new nodes to the cluster (for a total of 6), but I'll leave my RF at 3. This makes each node responsible for 50% of the data, or 50GB. When my product team loads more data to meet their "doubling" requirement, each node should climb back up to about 100GB. A single copy of the data is now 200GB. But with each node responsible for 50%, each 200GB drive still only has 100GB.
Example #2:
Let's say that the cluster above with 6 nodes is capable of supporting an operational load of 10,000 operations per second (ops). My product team comes to me again, saying that for the holiday season they project needing to support 20,000 ops. As the current cluster can only support half of that, it will choke under the intense throughput, and one or more nodes may crash.
As Cassandra scales linearly, the way to achieve this is to (again) double the size of the cluster. So I increase it from 6 nodes to 12 nodes, while still maintaining my RF of 3. After running some performance testing, they verify that it can indeed support 20,000 ops. As a single copy of my data is 200GB, the total data footprint remains 600GB. With 12 nodes, each node is now responsible for only 25% of the data, or 50GB.
So scalability is the advantage. But how about modeling the data? The main idea in distributed database modeling, is two-fold:
Build a table structure which is keyed to distribute well. We don't want uneven amounts of data on each node.
Build the key on the table so that it matches our query requirements.
One of the drawbacks of a NoSQL database, is that your query patterns become restricted. In an effort to cut down on network time, you want to ensure that your query can be served by a single node.
This usually means using natural keys, as those are more in-line with what you are asking of your data. Surrogate keys (alpha, numerical, or both) distribute well, but aren't really useful for querying. User "Bob Jones" might be id "3582346556230" in my system. But when I want to query Bob's data, I'll probably never want to ask for it by "3582346556230," because that doesn't mean anything to the application or the context in which the data is used.
Also, you want your data to have structure. Unstructured data is un-queryable data. Simple as that. If you want unstructured data to be queryable, you need to parse-out its identifying aspects to be used as keys. You don't want to "search" or run SELECT * FROM queries. Full table scans in NoSQL databases are even more resource consuming than their RDBMS counterparts, because they have to check each node, sort through replicas, and thus incurs extra network time.
NoSQL databases give you the ability to scale (for increases in data or demand). But it's important to note that their scalability can make some things (which a RDBMS might be good at), more difficult than you're used to.
The R in RDBMS, relational, is the biggest thing missing from Mongo. There's very little to no way to make the database understand how entries in different tables collections relate to each other. One of the big strengths of RDBMSs is the ability to define constraints which the database will enforce, most typically foreign key constraints which ensure that an id in one table refers to an existing id in another table.
One requirement for the database to be able to enforce such constraints is obviously that everything needs to go through one source of truth and there needs to be one central entity cross-checking the data; it cannot be decentralised since discrepancies between two different primary sources can lead to data inconsistencies.
In Mongo, each data blob is pretty much independent. It doesn't refer to other entries in any way enforced by the database. Mongo also has weak to no ACID guarantees, meaning there's little protection against race condition inserts or updates. In a word: Mongo makes little guarantees with regards to data consistency and mostly offloads these kinds of concerns to the application layer. That allows it to work more decentralised.
E.g. a good way to scale Mongo is to have many secondary servers which replicate a primary server for read-only access. There's no guarantee that the primary and secondaries will be in sync at any given time, it may take a couple of seconds for data written to the primary to trickle to the secondaries. But this allows you to have a virtually unlimited number of secondary read-only servers, which is great for scaling a database under heavy read load.
The way specifically Mongo handles its clusters also allows it to have a very high uptime, as the cluster will reorganise itself into primaries and secondaries automatically if a server goes down. This even allows for rolling maintenance without any client downtime.
Not having to enforce complex constraints or transaction consistency during writing also allows a more fire-and-forget style of writing to the database, which can be much faster. Again, at the cost of allowing inconsistent data. Which is why most writing pretty much means atomically updating a single document in a collection with no guarantees about other documents, which is something of a different paradigm than RDBMS transactional updates across many tables.
I would not recommend Mongo for storing things like a financial ledger, which heavily relies on transactional guarantees for consistency. However, things like Twitter are a perfect case for it: many independent snippets of data which must be read by a massive number of clients.
For my application I need to move old data periodically from a mongodb server to another one (ie, two distinct servers). I also want to be able to query those data as if they were the same database.
In short terms, I want to be able to see two mongodb instances (on two different servers) as one and be able to control when and where the data is stored.
I read about the concept of sharding and chunks and rapidly saw the moveChunk function which can easily do what I want.
The problem is that it seems to be impossible to configure such architecture in mongodb. Am I missing something here?
Archiving Deleted Documents
For the problem of keeping the deleted documents, you have no option to achieve this with build-in features/mechanisms like sharding or replication. The only way to do it is to handle that case manually, e.g. holding a separate collection for deleted documents, and simply move documents to that collections instead of deleting them.
For your global problem of moving data you have the following two options:
Sharding
Using sharding you will split your data into pieces which will be stored on two (in your case) different servers. In this scenario you can use the moveChunk method as you have mentioned. But this method is very tricky, as for that you will need to disable the built-in automatic balancer to have a full manual control over your chunks. Anyway, this is not recommended by the MongoDB:
Only use the moveChunk in special circumstances such as preparing your sharded cluster for an initial ingestion of data, or a large bulk import operation. In most cases allow the balancer to create and balance chunks in sharded clusters
Besides this will only allow to split data, and finally, to get to your goal, you will end up with one full and one empty shard.
Replication
The replication approach is much more safe and easy to achieve. You can simply configure a replica set and add your second server to that set.
If the data is too big, you can configure your second server as hidden. So that no reads will be performed towards that server, so no inconsistent data will be received. After the data replication is finished, you will have the copy of your data on both servers.
As for using both servers as a single server, if you need to balance the requests between these tow, you can configure your readPreference to secondary, which will assure that all the reads are being sent to the secondary server, and writes by default are done on the primary.
In this case your code will be unaware of what server you are querying. You will just run your client methods, and the rest will be done behind the drivers.
Conclusion
So my advice would be to use the replication approach as more clean, pain-less and safe solution.
I am working on an application in which there is a pretty dramatic difference in usage patterns between "hot" data and other data. We have selected MongoDB as our data repository, and in most ways it seems to be a fantastic match for the kind of application we're building.
Here's the problem. There will be a central document repository, which must be searched and accessed fairly often: it's size is about 2 GB now, and will grow to 4GB in the next couple years. To increase performance, we will be placing that DB on a server-class mirrored SSD array, and given the total size of the data, don't imagine that memory will become a problem.
The system will also be keeping record versions, audit trail, customer interactions, notification records, and the like. that will be referenced only rarely, and which could grow quite large in size. We would like to place this on more traditional spinning disks, as it would be accessed rarely (we're guessing that a typical record might be accessed four or five times per year, and will be needed only to satisfy research and customer service inquiries), and could grow quite large, as well.
I haven't found any reference material that indicates whether MongoDB would allow us to place different databases on different disks (were're running mongod under Windows, but that doesn't have to be the case when we go into production.
Sorry about all the detail here, but these are primary factors we have to think about as we plan for deployment. Given Mongo's proclivity to grab all available memory, and that it'll be running on a machine that maxes out at 24GB memory, we're trying to work out the best production configuration for our database(s).
So here are what our options seem to be:
Single instance of Mongo with multiple databases This seems to have the advantage of simplicity, but I still haven't found any definitive answer on how to split databases to different physical drives on the machine.
Two instances of Mongo, one for the "hot" data, and the other for the archival stuff. I'm not sure how well Mongo will handle two instances of mongod contending for resources, but we were thinking that, since the 32-bit version of the server is limited to 2GB of memory, we could use that for the archival stuff without having it overwhelm the resources of the machine. For the "hot" data, we could then easily configure a 64-bit instance of the database engine to use an SSD array, and given the relatively small size of our data, the whole DB and indexes could be directly memory mapped without page faults.
Two instances of Mongo in two separate virtual machines Would could use VMWare, or something similar, to create two Linux machines which could host Mongo separately. While it might up the administrative burden a bit, this seems to me to provide the most fine-grained control of system resource usage, while still leaving the Windows Server host enough memory to run IIS and it's own processes.
But all this is speculation, as none of us have ever done significant MongoDB deployments before, so we don't have a great experience base to draw upon.
My actual question is whether there are options to have two databases in the same mongod server instance utilize entirely separate drives. But any insight into the advantages and drawbacks of our three identified deployment options would be welcome as well.
That's actually a pretty easy thing to do when using Linux:
Activate the directoryPerDB config option
Create the databases you need.
Shut down the instance.
Copy over the data from the individual database directories to the different block devices (disks, RAID arrays, Logical volumes, iSCSI targets and alike).
Mount the respective block devices to their according positions beyond the dbpath directory (don't forget to add the according lines to /etc/fstab!)
Restart mongod.
Edit: As a side note, I would like to add that you should not use Windows as OS for a production MongoDB. The available filesystems NTFS and ReFS perform horribly when compared to ext4 or XFS (the latter being the suggested filesystem for production, see the MongoDB production notes for details ). For this reason alone, I would suggest Linux. Another reason is the RAM used by rather unnecessary subsystems of Windows, like the GUI.
I'm planning a product that will process updates from multiple data feeds. Input-data is guesstimated to be a total of 100Mbps stream containing 100 byte sized messages. These messages contain several data fields that needs to be checked for correlation with the existing data set within the application. If a input-message correlates with an existing data record, then the input-message will update the existing data-record, if not: it will create a new record. It is assumed that data are updated every 3 seconds in average.
The correlation process is assumed to be a bottleneck, and thus I intend to make our product able to run balanced in multiple processes if needed (most likely on a separate hardware or VM). Somewhat in the vicinity of Space-based architecture. I'd then like a shared storage between my processes so that all existing data records are visible to all the running processes. The shared storage will have to fetch possible candidates for correlation through a query/search based on some attributes (e.g. elevation). It will have to offer configuring warm redundancy, and a possibility to store snapshots every 5 minutes for logging.
Everything seems to be pointing towards MongoDB, but I'd like a confirmation from you that MongoDB will meet my needs. So do you think it is a go?
-Thank you
NB: I am not considering a relational database because we want to focus all coding in our application, instead of having to make 'stored procedures'/'functions' in a separate environment to optimize the performance of our system. Further, the data is diverse and I don't want to try normalize it into a schema.
Yes, MongoDB will meet your needs. I think the following aspects of your description are particularly relevant in your DB selection decision:
1. An update happens every 3 seconds
MongoDB has a database level write-lock (usually short lived) that blocks read operations. This means that you want will want to ensure that you have enough memory to fit your working set, and you will generally not run into any write-lock issues. Note that bulk inserts will hold the write lock for longer.
If you are sharding, you will want to consider shard keys that allow for write scaling i.e. distribute writes on different shards.
2. Shared storage for multiple processes
This is a pretty common scenario; in fact, many MongoDB deployments are expected be accessed from multiple processes concurrently. Unlike the write-lock, the read-lock does not block other reads.
3. Warm redundancy
Supported through MongoDB replication. If you'd like to read from secondary server(s) you will need to set the Read Preference to secondaryPreferred in your driver.
I'm reading about ArangoDB and it is more interesting but I can't find where in the documentation how ArangoDB scales. Does ArangoDB scale and can it use sharding like MongoDB or CouchDB?
EDIT
ArangoDB supports sharding since Version 2.0.
Version 3.0 will bring VelocyPack, which is a binary JSON representation optimized for compactness, parseability and composeability. It supersedes the shape concept / shaped JSON.
/EDIT
I am the chief architect of ArangoDB.
monkegjinni is right, ArangoDB did not support sharding, but replication. Why?
Short version:
Offering a support for fairly complex data models like graphs and documents gets into conflicts with how sharding works. However, with the efficiency of modern SSD and computers, we believe that almost all projects no longer need sharding. Today's computer will easily store all data on a single nodes. What these projects need is replication for load distribution which is supported by ArangoDB.
Long version:
There are actually to separate scaling issues.
The first issue is distributing the request over several servers to balance the request load.
ArangoDB will support this through synchronous replication of writes and distribution of the read requests.
Note that most database systems follow a very similar path, i.e., they support distributing the requests either with restricted consistency guarantees or they allow writes only on one node and distribute the read requests. They have this restriction because distributing write requests and supporting full consistency is impossible to do efficiently. And doing it inefficiently will negate the gain that we wanted to achieve through distribution.
The second issue is distributing the data over several servers to allow larger datasets.
ArangoDB does not support distributing the data over several servers.
We have made this decision, because distributing the data over several servers always comes at a price.
This price can be very explicit. For example it can be that the data model is very limited. This is the route that key value stores such as Dynamo or RIAK have taken. Here the data model and the supported queries are so simple, that it is always possible to direct a query to the server (or the small number of servers) on which the requested value live.
Note that we do believe that this approach is valid for some applications (e.g. Amazons database). But we believe that the number of applications that truly need to store so much data that they must distribute it over a large number of servers and must therefore restrict the access pattern to key-value is very small.
Or the price can be hidden. This is for example the case if the data is distributed and the database system allows general queries. In that case the query must be distributed over all servers (because the data you are looking for may live on any of the servers). That makes the queries inefficient.
The ArangoDB approach is rather to squeeze the most onto one server (well ArangoDB supports multiple servers - but to support availability). For this it uses two main strategies.
One strategy is to make use of SSDs. Note that the capacity of SSDs is growing larger at an incredible rate (you can buy a Terabyte of SSD for by far less money that a second server would cost you). And endurance (the total amount of data that can be written to a SSD) goes up to Petabytes (now that vendors finally get the wear leveling algorithms right) - so reliability of SSDs is no longer an issue. And the performance of those SSDs is very nice (closer to main memory than to ordinary disks).
The other strategy is to store the data efficiently. ArangoDB uses shapes to store documents: A shape is the information which attributes and attribute types a document has - all document with the same shape share the representation of this information. This means that documents can be stored in less space than the JSON or BSON representation would require.
As I understand, it does not allow sharding (prior to version 2.0), but replication.
From the link
AvocadoDB effortlessly permits replications. We like the “zero-admin principle”. Making replications with AvocadoDB is really easy: Insert IP address and go!
Following replication types are intended for version 2:
master-master synchronous,
master-master asynchronous,
master-slave synchronous,
master-slave asynchronous