How should I understand mongoDB GridFS? - mongodb

I am now using perl script client to store some big data into mongoDB .But now I met a problem ,some document exceeds the size limit of 16M,so ,I have to choose GridFS.From the GridFS document ,I read this:
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.
Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, [1] and stores each of those chunks as a separate document. By default GridFS limits chunk size to 256k.
It really make me confused.What does it mean by "file"?"Instead of storing a file in a single document",it means , mongoDB stores a file in a single document without using GridFS,right ?But I think it should say:"Instead of storing a document in a single file,...".So ,the relationship and difference between "file" and "document" make me confused.

What does it mean by "file"?
A file. A word document, Excel spreadsheet, HTML file anything that is a file. GridFS is designed for file storage.
it means , mongoDB stores a file in a single document
MongoDB does not do anything, it does not even manage GridFS, the documentation assumes you come to GridFS after encountering the limited size of a single document, as you have.
Instead of storing a document in a single file,...
Nope, that is incorrect. What is a document? MongoDBs own records are called documents, how can you store those within files in the database? You store data within documents in the database.
So ,the relationship and difference between "file" and "document" make me confused.
File is a physical file and a document is basically a row.

Related

Upload pdf with express to mongo db

I'm trying to implement pdf upload with express and save it to mongo db . But it was only saving to localy my pc i can't send it to mongo db. Can any one help me how can i upload or access pdf files using multer or any other library.
As far as I know, you have 2 -maybe 3- ways of using "files" with MongoDB, the approach will depend on your use case (and size of the document):
Store the files directly in the document
As you know you can store anything you want in a JSON/BSON document, you just need to store the bytes and your PDF will be part of the document.
You just need to be careful about the document size limit of 16Mb.
You can add meta-data for the file in the JSON document, and they are stored in the same place.
For example in Java you will just store byte[] in an attribute, look at his test:
https://github.com/mongodb/mongo-java-driver/blob/master/src/test/com/mongodb/ByteTest.java#L188
Use GridFS
GridFS allows you to store files of "any size" into MongoDB. The file you are storing is divided in chunks by the driver and stored into smaller documents into MongoDB, when you read it it will be put back in a single file. With this approach, you do not have any size limit.
In this case, if you want to add metadata, you create a JSON document that you store with all the attributes and a reference to the GridFS file.
You can find information about this here: http://docs.mongodb.org/manual/core/gridfs/
and to this Java test:
https://github.com/mongodb/mongo-java-driver/blob/master/src/test/com/mongodb/gridfs/GridFSTest.java
Create Reference to an external storage
This one is not directly a "MongoDB" use case, but I think it is important to mention it. You can obviously store the files in some special storage and use MongoDB just for the metadata and reference this file. I will take a stupid example but suppose you want to create a Video application, you can store the videos in YouTube and reference this into the document with all your application metadata.
So let's stay on your use case/question, so you can use approaches 1 & 2, and it will depend on the size of the files and how do you access them. If you can give us more information about your application people may have a stronger opinion on the best approach.
If you are looking for people doing this, you can look at this presentation from MongoDB World: http://www.mongodb.com/presentations/translational-medicine-platform-sanofi-0

How to store lookup values in MongoDB?

I have a collection in db which represents mediafiles.
And among other info I shoud store format name. I wonder if there best practices to store info like that. Is it better to create new collection for file formats and use link to that collection or to store format name right in file documents as a plain text? What about perfomance and compression? It supposed to be more than a billion documents in db. What would mongo expers suggest in this situation?
Embedded documents are the preferred approach.
In your case, it means it is better to store file format in the same collection.
Putting the file format into the separate collection means creating a new file on the disk.
It is a slower option and should be used if your document ( any of them ) exceeds 16 MB in size.
See these links for more information
6 Rules of Thumb for MongoDB Schema Design
and
How to Program with MongoDB Using the .NET Driver
I've done some benchmarks and figured out that in my case storing "lookup values" as plaintext is more efficient in terms of disk space than embedded document and than reference to outstanding collection. Sorry for poor terminology.

Mongodb to Mongodb GridFS

I'm new to mongodb. I wanted to know if I initially code my app using mongodb and later I want to switch to mongodb gridfs, will the switching (of a filled large database) be possible.
So, if I am using mongo db initially and after some time of running the app the database documents exceed the size of 16Mb, I guess I will have to switch to gridfs. I want to know how easy or difficult will it be to switch to gridfs and whether that will be possible?
Thanks.
GridFS is used to store large files. It internally divides data in chunks(By default 255 KB). Let me give you an example of saving a pdf file in MongoDB using both ways. I am assuming the size of pdf as 10 MB so that we can see both normal way and GridFS way.
Normal Way:
Say you want to store it in normal_book collection in testDB database. So, whole pdf is stored in this collection and when you want to fetch it using db.normal_book.find(), whole pdf will be fetched in memory.
GridFS way:
In GridFS, we have two collections, one is for storing data and other is for storing its metadata. It will store data in fs.chunks collection and metadata in fs.filescollection. Now, the beauty of GridFS is that you can find the whole file at once or you can find chunks individually.
Now coming to your question, there is no direct way or property to
tell MongoDB that now I want to switch to GridFS. You need to
reinsert data in GridFS using mongofiles command-line tool or
using MongoDB's drivers.

GridFS and Cloning to another server

I have a local MongoDB database that I am starting to put some files into GridFS for caching purposes. What I want to know is:
Can I use db.cloneCollection() on another server to clone my fs.* collections? If I do that will the GridFS system on that server work properly? Essentially I have to "pull" data from another machine that has the files in GridFS, I can't direcly add them easily to the production box.
Edit: I was able to get on my destination server and use the following commands from the mongo shell to pull the GridFS system over from another mongo system on our network.
use DBName
db.cloneCollection("otherserver:someport","fs.files")
db.cloneCollection("otherserver:someport","fs.chunks")
For future reference.
The short answer is of course you can, it is only a collection and there is nothing special about it at all. The longer form is explaining what GridFS actually is.
So the very first sentence on the manual page:
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.
GridFS is not something that "MongoDB does", internally to the server it is basically just two collections, one for the reference information and one for the "chunks" that are used to break up the content so no individual document exceeds the 16MB limit. But most importantly here is the word "specification".
So the server itself does no magic at all. The implementation to store reference data and chunks is all done at the "driver" level, where in fact you can name the collections you wish to use rather than just accept the defaults. So when reading and writing data, it is the "driver" that does the work by pulling the "chunks" contained in the reference document or creating new "chunks" as data is sent to the server.
The other common misconception is that GridFS is the only method for dealing with "files" when sending content to MongoDB. Again in that first sentence, it actually exists as a way to store content that exceeds the 16MB limit for BSON documents.
MongoDB has no problem directly storing binary data in a document as long as the total document does not exceed the 16MB limit. So in most use cases ( small image files used on websites ) the data would be better stored in ordinary documents and thus avoid the overhead of needing to read and write with multiple collections.
So there is no internal server "magic". These are just ordinary collections that you can query, aggregate, mapReduce and even copy or clone.

MongoDB GridFS Size Limit

I am using MongoDB as a convenient way of storing a dataset as a series of columns where there is a document that stores the values for a given column and another document that stores the details of the detaset, and a mapping to the other documents with the associated column values. The issue I'm now facing as things get bigger is that I can no longer store the entire column in a single document.
I'm aware that there is also the GridFS option, the only downside is that I believe it stores the files as blobs meaning I would lose random access to a chunk of the column, or the value at a specified index, something that was incredibly useful from the document store, however I may not ahve any other option.
So my question is: does GridFS also impose an upper limit on the size of documents and if so does anyone know what this is. I've looked in hte docs and haven't found anything, but it may be I'm not looking in the correct place or that there is a limit but it's not well documented.
Thanks,
Vackar
GridFS
Per the GridFS documentation:
Instead of storing a file in an single document, GridFS divides a file
into parts, or chunks, and stores each of those chunks as a separate
document. By default GridFS limits chunk size to 256k. GridFS uses
two collections to store files. One collection stores the file chunks,
and the other stores file metadata.
GridFS will allow you to store arbitrarily large files however this really won't help your use case. A file in GridFS will effectively be a large binary blob and you will not get any of the benefits of structured documents and indexing.
Schema Design
The fundamental challenge you have is your approach to schema design. If you are creating documents that are likely to grow beyond the 16Mb document limit, these will also have a significant impact on your database storage and fragmentation as the documents grow in size.
The appropriate solution would be to rethink your schema approach so that you do not have unbounded document growth. This probably means flattening the array of "columns" that you are growing so it is represented by a collection of documents rather than an array.
A better (and separate) question to ask would be how to refactor your schema given the expected data growth patterns.