MongoDB as file storage - mongodb

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.

I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.

Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.

I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

Related

Is it efficient to store images inside MongoDB using GridFS?

I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph

Is MongoDB a good choice for storing a huge set of text files?

I'm currently building a system (with GCP) for storing large set of text files of different sizes (1kb~100mb) about different subjects. One fileset could be more than 10GB.
For example:
dataset_about_some_subject/
- file1.txt
- file2.txt
...
dataset_about_another_subject/
- file1.txt
- file2.txt
...
The files are for NLP, and after pre-processing, as pre-processed data are saved separately, will not be accessed frequently. So saving all files in MongoDB seems unnecessary.
I'm considering
saving all files into some cloud storage,
save file information like name and path to MongoDB as JSON.
The above folders turn to:
{
name: dataset_about_some_subject,
path: path_to_cloud_storage,
files: [
{
name: file1.txt
...
},
...
]
}
When any fileset is needed, search its name in MongoDB and read the files from cloud storage.
Is this a valid way? Will there be any I/O speed problem?
Or is there any better solution for this?
And I've read about Hadoop. Maybe this is a better solution?
Or maybe not. My data is not that big.
As far as I remember, MongoDB has a maximum object size of 16 MB, which is below the maximum size of the files (100 MB). This means that, unless one splits, storing the original files in plaintext JSON strings would not work.
The approach you describe, however, is sensible. Storing the files on cloud storage such as S3 or Azure, is common, not very expensive, and does not require a lot of maintenance comparing to having your own HDFS cluster. I/O would be best by performing the computations on the machines of the same provider, and making sure the machines are in the same region as the data.
Note that document stores, in general, are very good at handling large collections of small documents. Retrieving file metadata in the collection would thus be most efficient if you store the metadata of each file in a separate object (rather than in an array of objects in the same document), and have a corresponding index for fast lookup.
Finally, there is another aspect to consider, namely, whether your NLP scenario will process the files by scanning them (reading them all entirely) or whether you need random access or lookup (for example, a certain word). In the first case, which is throughput-driven, cloud storage is a very good option. In the latter case, which is latency-driven, there are document stores like Elasticsearch that offer good fulltext search functionality and can index text out of the box.
I recommend you to store large file using storage service provide by below. It also support Multi-regional access through CDN to ensure the speed of file access.
AWS S3: https://aws.amazon.com/tw/s3/
Azure Blob: https://azure.microsoft.com/zh-tw/pricing/details/storage/blobs/
GCP Cloud Storage: https://cloud.google.com/storage
You can rest assured that for the metadata storage you propose in mongodb, speed will not be a problem.
However, for storing the files themselves, you have various options to consider:
Cloud storage: fast setup, low initial cost, medium cost over time (compare vendor prices), datatransfer over public network for every access (might be a performance problem)
Mongodb-Gridfs: already in place, operation cost varies, data transfer is just as fast as from mongo itself
Hadoop cluster: high initial hardware and setup cost, lower cost over time. Data transfer in local network (provided you build it on-premise.) Specialized administration skills needed. Possibility to use the cluster for parrallel calculations (i.e. this is not only storage, this is also computing power.) (As a rule of thumb: if you are not going to store more than 500 TB, this is not worthwile.)
If you are not sure about the amount of data you cover, and just want to get started, I recommend starting out with gridfs, but encapsulate in a way that you can easily exchange the storage.
I have another answer: as you say, 10GB is really not big at all. You may want to also consider the option of storing it on your local computer (or locally on one single machine in the cloud), simply on your regular file system, and executing in parallel on your cores (Hadoop, Spark will do this too).
One way of doing it is to save the metadata as a single large text file (or JSON Lines, Parquet, CSV...), the metadata for each file on a separate line, then have Hadoop or Spark parallelize over this metadata file, and thus process the actual files in parallel.
Depending on your use case, this might turn out to be faster than on a cluster, or not exceedingly slower, especially if your execution is CPU-heavy. A cluster has clear benefits when the problem is that you cannot read from the disk fast enough, and for workloads executed occasionally, this is a problem that one starts having from the TB range.
I recommend this excellent paper by Frank McSherry:
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Which is an advantage of storing images directly in MongoDB instead of serverside folder

I suppose that storing images (or any binary data - pdfs, movies, etc. ) outside of DB (MongoDB in my case) and putting them in public server folder can be at least faster (no encoding, decoding and things around that).
But since there is such an option in MondoDB, I'd like to know advantages of using this, and use cases, when that approach is recommended.
Replication: It is pretty easy to set up a highly available replica set. So even if one machine goes down, the files would still be available. While this is possible to achieve by various means for a simple filesystem as well, the overhead for this might well eliminate the performance advantage (if there is any: MongoDB has quite sophisticated internal caching going on). Furthermore, setting up a DRBD and making sure consistency and availability requires quite more knowledge and administrative effort than with MongoDB. Plus, you'd need to have your DB be highly available as well.
Scalability: It can get quite complicated and/or costly when your files exceed the storage capacity of a single node. While in theory you can scale vertically, there is a certain point where the bang you get for the buck decreases and scaling horizontally makes more sense. However, with a filesystem approach, you'd have to manage which file is located at which node, how and when to balance and whatnot. MongoDB's GridFS in a sharded environment does this for you automatically and – more important – transparently. You neither have to reinvent the wheel nor maintain it.
Query by metadata: While in theory you can do this by an approach with a database and links to a filesystem, GridFS comes with means to insert arbitrary metadata and query by it. Again, this saves you reinventing the wheel. As an interesting example is that finding duplicates is quite easy with GridFS: a hash sum is automatically calculated for each file in GridFS. With a rather simple aggregation, you can find dupes and then deal with them accordingly.
When you have large amount of binary data and you want to take advantage of sharding, you can go with storing the binary data in mongo db using gridfs. But from performance point of view, Obviously as you pointed storing the images in a file system is a better way.

mongoDB vs relational databases when data can't fit into memory?

First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)
MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.
This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).
In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes
If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.

Is GridFS fast and reliable enough for production?

I develop a new website and I want to use GridFS as storage for all user uploads, because it offers a lot of advantages compared to a normal filesystem storage.
Benchmarks with GridFS served by nginx indicate, that it's not as fast as a normal filesystem served by nginx.
Benchmark with nginx
Is anyone out there, who uses GridFS already in a production environment, or would use it for a new project?
I use gridfs at work on one of our servers which is part of a price-comparing website with honorable traffic stats (arround 25k visitors per day). The server hasn't much ram, 2gigs, and even the cpu isn't really fast (Core 2 duo 1.8Ghz) but the server has plenty storage space : 10Tb (sata) in raid 0 configuration. The job the server is doing is very simple:
Each product on our price-comparer has an image (there are around 10 million products according to our product db), and the servers job is to download the image, resize it, store it on gridfs, and deliver it to the visitors browser... if it's not present in the grid... or... deliver it to the visitors browser if it's already stored in the grid. So, this could be called as a 'traditional cdn schema'.
We have stored and processed 4 million images on this server since it's up and running. The resize and store stuff is done by a simple php script... but for sure, a python script, or something like java could be faster.
Current data size : 11.23g
Current storage size : 12.5g
Indices : 5
Index size : 849.65m
About the reliability : This is very reliable. The server doesn't load, the index size is ok, queries are fast
About the speed : For sure, is it not fast as local file storage, maybe 10% slower, but fast enough to be used in realtime even when the image needs to be processed, which is in our case, very php dependant. Maintenance and development times have also been reduced: it became so simple to delete a single or multiple images : just query the db with a simple delete command. Another interesting thing : when we rebooted our old server, with local file storage (so million of files in thousands of folders), it sometimes hangs for hours cause the system was performing a file integrity check (this really took hours...). We do not have this problem any more with gridfs, our images are now stored in big mongodb chunks (2gb files)
So... on my mind... Yes, gridfs is fast and reliable enough to be used for production.
As mentioned, it might not be as fast as an ordinary filesystem but then it gives you man advantages over ordinary filesystems which I think are worth giving up a bit speed for.
Ultimately, with sharding, you might reach a point however where the GridFS storage actually becomes the faster option as opposed to an ordinary filesystem and a single node.
Heads-up on repairs for larger DBs though - a new system we're developing, mongo didn't cleanly exit, and repairing the 7TB GridFS looks like it will take 130 hrs.
Because of this, I think I'll look at switching to OpenStack Swift or Ceph.
Still, until then it was good. And the nginx-gridfs module is sweet.
mdirolf's nginx-gridfs module is great and fairly easy to get setup. We're using it in production at paint.ly to serve all of the paintings and there have been no problems so far.
I don't recommend using gridfs unless you know what you are doing.
GridFS is just abstraction layer which splits files for chunks and stores the files in two collections. More files - more overhead. If you expect files be pretty the same size, not exceeding 32M or so - you are in the right way.
Do not try to store large files on gridfs. Why?
Drivers on different languages may read the whole file.(e.g. chunks) when reading the little part of the file.
Modifying the file may affect all chunks and increase database load
If your file system is growing up, you will have to decide to shard the gridfs. Be careful! Consistence is not guaranteed when sharding is initializing!
If you think about read loaded project - consider loading the files into docs directly (if 16M or less size) or choose another clusterfs, and link filename/inode to your logic.
Hope this helps.