Does Snappy/Zlib compressions in mongo only reduced the bandwidth or the document size saved in DB - mongodb

Recently for one of the projects, we require compressions for MongoDB documents, Found out there are multiple compressions algo like snappy, zlib and etc.
Most of the articles on these only talked about network compressions and reduced bandwidth usage but didn't mention the space saved due to compression.
My question is do these compressions save the disk space too or decompress the data before storing it in the DB.

The compression that you configure in the mongoclient is responsible only for network traffic between client and server. To configure how the data should be stored on the server side, you should use storage engine, see here. Also you can check this question.

Related

Is it efficient to store images inside MongoDB using GridFS?

I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph

Performance of MongoDB using GlusterFS

We have several disk arrays that are shared in a distributed file system across multiple servers using GlusterFS. It works really well.
The problem is, we have no available storage that is not appropriated to the distributed file system. As a result, I have stored our MongoDB data within the distributed file system.
For now, I have no benchmarks for performance considering it is the only available solution for my setup. However, I've been thinking of dedicating a disk array and server to only mongo, where I would plug the disk array directly into the server.
Does anyone know why you should, or should not store mongo data on top of distributed file system? I know Mongo has it's own sharding solution for precisely this reason, so I'm thinking that it's not ideal. If you have multiple blocks of data that mongo thinks are in the same location, however they are actually on different storage media, can this cause a performance issue?

Mongodb base64 image vs gridfs

I'm using mongodb and I want to store some thumbnails in my server. What's best? Using GridFS or converting those images to base64 and store them directly inside a document.
As always there are some (dis) advantages:
Pros:
Less Database requests if only the document+thumbnail is needed.
Less client requests. (of course you could fetch the thumbnails from GridFS, and put them within the response, but that would result in more database requests)
Neutral:
Storage requirements are equal
Cons:
You can't reuse the very same image thumbnail in another document easily, because there's no id to reference to. (For us, that's not an issue, because the server responses are gzip compressed and you can't really tell the difference between 1 and 5 equal images)
With MongoDB and NoSQL it's all about knowing your use cases!
If lot's of your documents share the same image, you should use GridFS and just provide links to those files, because 1. sharing data is more space efficient and 2. the client can cache the image request and just has to retrieve it once.
If your clients will always need the thumbnail, you maybe should consider embedding the files as base64 within the response. This is especially nice, if 1. images are not shared between documents and/or 2. images change often and caching is useless / not possible.
Base64 of course means more traffic on the wire, because it needs 8 bits to transfer 6 bits. i.e. 75% efficiency. This of course only affects the client-server communication, because within MongoDB you can always store your data as binary field.
Do you prefer more database requests (= using GridFS)? Or bigger data/document size on the wire (= embedded)?
What we did:
We use embedded thumbnails, even if we potentially have duplicate images. After activating gzip compression on the server, the server-client transfer size didn't matter anymore. But as said before, it's a tradeoff: Now we have less client requests and less database requests, but because embedding makes caching the images impossible, we now have more data on the wire.
Conclusion:
There's no one size fits all solution.
It really depends on your server side technology and personal preference. 10gen suggests you use documents unless you are storing files larger than the document limit (16MB). I would suggest that you do whatever is easier given the language you are working with. If you have other documents to model after follow the document, otherwise give gridFS a shot.
I suggest you to use GridFS. With GridFS, you can take the advantage of MongoDB REST API. So there won't be any overheat for retrieving documents using MongoDB API. REST API will do the all of hard work and will save you time.

MongoDB as file storage

i'm trying to find the best solution to create scalable storage for big files. File size can vary from 1-2 megabytes and up to 500-600 gigabytes.
I have found some information about Hadoop and it's HDFS, but it looks a little bit complicated, because i don't need any Map/Reduce jobs and many other features. Now i'm thinking to use MongoDB and it's GridFS as file storage solution.
And now the questions:
What will happen with gridfs when i try to write few files
concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
Maybe there are some other solutions that can solve my problem more efficiently?
Thanks.
I can only answer for MongoDB here, I will not pretend I know much about HDFS and other such technologies.
The GridFs implementation is totally client side within the driver itself. This means there is no special loading or understanding of the context of file serving within MongoDB itself, effectively MongoDB itself does not even understand they are files ( http://docs.mongodb.org/manual/applications/gridfs/ ).
This means that querying for any part of the files or chunks collection will result in the same process as it would for any other query, whereby it loads the data it needs into your working set ( http://en.wikipedia.org/wiki/Working_set ) which represents a set of data (or all loaded data at that time) required by MongoDB within a given time frame to maintain optimal performance. It does this by paging it into RAM (well technically the OS does).
Another point to take into consideration is that this is driver implemented. This means that the specification can vary, however, I don't think it does. All drivers will allow you to query for a set of documents from the files collection which only houses the files meta data allowing you to later serve the file itself from the chunks collection with a single query.
However that is not the important thing, you want to serve the file itself, including its data; this means that you will be loading the files collection and its subsequent chunks collection into your working set.
With that in mind we have already hit the first snag:
Will files from gridfs be cached in ram and how it will affect read-write perfomance?
The read performance of small files could be awesome, directly from RAM; the writes would be just as good.
For larger files, not so. Most computers will not have 600 GB of RAM and it is likely, quite normal in fact, to house a 600 GB partition of a single file on a single mongod instance. This creates a problem since that file, in order to be served, needs to fit into your working set however it is impossibly bigger than your RAM; at this point you could have page thrashing ( http://en.wikipedia.org/wiki/Thrashing_%28computer_science%29 ) whereby the server is just page faulting 24/7 trying to load the file. The writes here are no better as well.
The only way around this is to starting putting a single file across many shards :\.
Note: one more thing to consider is that the default average size of a chunks "chunk" is 256KB, so that's a lot of documents for a 600GB file. This setting is manipulatable in most drivers.
What will happen with gridfs when i try to write few files concurrently. Will there be any lock for read/write operations? (I will use it only as file storage)
GridFS, being only a specification uses the same locks as on any other collection, both read and write locks on a database level (2.2+) or on a global level (pre-2.2). The two do interfere with each other as well, i.e. how can you ensure a consistent read of a document that is being written to?
That being said the possibility for contention exists based on your scenario specifics, traffic, number of concurrent writes/reads and many other things we have no idea about.
Maybe there are some other solutions that can solve my problem more efficiently?
I personally have found that S3 (as #mluggy said) in reduced redundancy format works best storing a mere portion of meta data about the file within MongoDB, much like using GridFS but without the chunks collection, let S3 handle all that distribution, backup and other stuff for you.
Hopefully I have been clear, hope it helps.
Edit: Unlike what I accidently said, MongoDB does not have a collection level lock, it is a database level lock.
Have you considered saving meta data onto MongoDB and writing actual files to Amazon S3? Both have excellent drivers and the latter is highly redundant, cloud/cdn-ready file storage. I would give it a shot.
I'll start by answering the first two:
There is a write lock when writing in to GridFS, yes. No lock for reads.
The files wont be cached in memory when you query them, but their metadata will.
GridFS may not be the best solution for your problem. Write locks can become something of a pain when you're dealing with this type of situation, particularly for huge files. There are other databases out there that may solve this problem for you. HDFS is a good choice, but as you say, it is very complicated. I would recommend considering a storage mechanism like Riak or Amazon's S3. They're more oriented around being storage for files, and don't end up with major drawbacks. S3 and Riak both have excellent admin facilities, and can handle huge files. Though with Riak, last I knew, you had to do some file chunking to store files over 100mb. Despite that, it generally is a best practice to do some level of chunking for huge file sizes. There are a lot of bad things that can happen when transferring files in to DBs- From network time outs, to buffer overflows, etc. Either way, your solution is going to require a fair amount of tuning for massive file sizes.

Is GridFS fast and reliable enough for production?

I develop a new website and I want to use GridFS as storage for all user uploads, because it offers a lot of advantages compared to a normal filesystem storage.
Benchmarks with GridFS served by nginx indicate, that it's not as fast as a normal filesystem served by nginx.
Benchmark with nginx
Is anyone out there, who uses GridFS already in a production environment, or would use it for a new project?
I use gridfs at work on one of our servers which is part of a price-comparing website with honorable traffic stats (arround 25k visitors per day). The server hasn't much ram, 2gigs, and even the cpu isn't really fast (Core 2 duo 1.8Ghz) but the server has plenty storage space : 10Tb (sata) in raid 0 configuration. The job the server is doing is very simple:
Each product on our price-comparer has an image (there are around 10 million products according to our product db), and the servers job is to download the image, resize it, store it on gridfs, and deliver it to the visitors browser... if it's not present in the grid... or... deliver it to the visitors browser if it's already stored in the grid. So, this could be called as a 'traditional cdn schema'.
We have stored and processed 4 million images on this server since it's up and running. The resize and store stuff is done by a simple php script... but for sure, a python script, or something like java could be faster.
Current data size : 11.23g
Current storage size : 12.5g
Indices : 5
Index size : 849.65m
About the reliability : This is very reliable. The server doesn't load, the index size is ok, queries are fast
About the speed : For sure, is it not fast as local file storage, maybe 10% slower, but fast enough to be used in realtime even when the image needs to be processed, which is in our case, very php dependant. Maintenance and development times have also been reduced: it became so simple to delete a single or multiple images : just query the db with a simple delete command. Another interesting thing : when we rebooted our old server, with local file storage (so million of files in thousands of folders), it sometimes hangs for hours cause the system was performing a file integrity check (this really took hours...). We do not have this problem any more with gridfs, our images are now stored in big mongodb chunks (2gb files)
So... on my mind... Yes, gridfs is fast and reliable enough to be used for production.
As mentioned, it might not be as fast as an ordinary filesystem but then it gives you man advantages over ordinary filesystems which I think are worth giving up a bit speed for.
Ultimately, with sharding, you might reach a point however where the GridFS storage actually becomes the faster option as opposed to an ordinary filesystem and a single node.
Heads-up on repairs for larger DBs though - a new system we're developing, mongo didn't cleanly exit, and repairing the 7TB GridFS looks like it will take 130 hrs.
Because of this, I think I'll look at switching to OpenStack Swift or Ceph.
Still, until then it was good. And the nginx-gridfs module is sweet.
mdirolf's nginx-gridfs module is great and fairly easy to get setup. We're using it in production at paint.ly to serve all of the paintings and there have been no problems so far.
I don't recommend using gridfs unless you know what you are doing.
GridFS is just abstraction layer which splits files for chunks and stores the files in two collections. More files - more overhead. If you expect files be pretty the same size, not exceeding 32M or so - you are in the right way.
Do not try to store large files on gridfs. Why?
Drivers on different languages may read the whole file.(e.g. chunks) when reading the little part of the file.
Modifying the file may affect all chunks and increase database load
If your file system is growing up, you will have to decide to shard the gridfs. Be careful! Consistence is not guaranteed when sharding is initializing!
If you think about read loaded project - consider loading the files into docs directly (if 16M or less size) or choose another clusterfs, and link filename/inode to your logic.
Hope this helps.