MapReduce on gridfs file with mongodb - mongodb

I want to store bson documents in gridfs because they grow rapidly over 16MB. But i also have to do some mapreduce analytics on them. Is that possible or do i have to split the document in multiple documents to do that. Tutorials and other stuff are always talking about binary data like pictures, videos and so on but not about bson documents
Thanks.

GridFS is only meant to store binary files. It is not meant to split normal documents (which you call BSON documents). If your BSON documents are too large, you need to rethink your data schema. If you provide that schema, I can update my answer with hints and tips.

Related

Is it possible to use mongoDB geospacial indexes with grid FS

I have a large geojson feature collection which is over 16MB. I am hoping to insert the data into MongoDB so that I can utilize the geospatial functionality that MongoDB offers ($geoIntersects, $geoWithin, etc). Due to the large size of the file, I cannot store the data in one MongoDB document.
I have used GridFS to break the file up into several chunks within MongoDB but I am now unsure whether I can now utilize the geospatial features that I would like to.
Does anyone know if this is possible and if so whats the best way to do something like this?
One way you should be able to achieve what you are describing is to extract the data to be indexed into a separate collection, and add indexes on that collection.
GridFS essentially takes the data, splits it into 16 MB chunks or less and stores each chunk as a document. I don't see how the chunks would be supported by a geo index.

How to store lookup values in MongoDB?

I have a collection in db which represents mediafiles.
And among other info I shoud store format name. I wonder if there best practices to store info like that. Is it better to create new collection for file formats and use link to that collection or to store format name right in file documents as a plain text? What about perfomance and compression? It supposed to be more than a billion documents in db. What would mongo expers suggest in this situation?
Embedded documents are the preferred approach.
In your case, it means it is better to store file format in the same collection.
Putting the file format into the separate collection means creating a new file on the disk.
It is a slower option and should be used if your document ( any of them ) exceeds 16 MB in size.
See these links for more information
6 Rules of Thumb for MongoDB Schema Design
and
How to Program with MongoDB Using the .NET Driver
I've done some benchmarks and figured out that in my case storing "lookup values" as plaintext is more efficient in terms of disk space than embedded document and than reference to outstanding collection. Sorry for poor terminology.

How to store data with variables size(1-30 mb) in MongoDB?

I am using Mongo-java Driver 2.13. I want to store pdf files of varying size(1mb to 30mb).
Note: I know I can't store a document having size greater than 16mb in
MongoDB. Then, I need to use GridFS.
I want to save pdf files of small size(<16 mb) in BookPDFs collection in normal way. For larger files, I need to store it using GridFS(in fs.chunks and fs.files). When I want to retrieve all PDFs, it need to access BookPDFs collection and fs.chunks, fs.files collections. It will loose atomicity for that find operation and also, it will take much time to get data from different collections.
While fetching data, I don't need it in chunks. So GridFS is not of so much use here. What should be the best way:
Save all of my data in GridFS.(title, author, etc fields in fs.files as
metadata)
Save data in different collections according to size
Any other approach?
Thanks in advance.

MongoDB GridFS Size Limit

I am using MongoDB as a convenient way of storing a dataset as a series of columns where there is a document that stores the values for a given column and another document that stores the details of the detaset, and a mapping to the other documents with the associated column values. The issue I'm now facing as things get bigger is that I can no longer store the entire column in a single document.
I'm aware that there is also the GridFS option, the only downside is that I believe it stores the files as blobs meaning I would lose random access to a chunk of the column, or the value at a specified index, something that was incredibly useful from the document store, however I may not ahve any other option.
So my question is: does GridFS also impose an upper limit on the size of documents and if so does anyone know what this is. I've looked in hte docs and haven't found anything, but it may be I'm not looking in the correct place or that there is a limit but it's not well documented.
Thanks,
Vackar
GridFS
Per the GridFS documentation:
Instead of storing a file in an single document, GridFS divides a file
into parts, or chunks, and stores each of those chunks as a separate
document. By default GridFS limits chunk size to 256k. GridFS uses
two collections to store files. One collection stores the file chunks,
and the other stores file metadata.
GridFS will allow you to store arbitrarily large files however this really won't help your use case. A file in GridFS will effectively be a large binary blob and you will not get any of the benefits of structured documents and indexing.
Schema Design
The fundamental challenge you have is your approach to schema design. If you are creating documents that are likely to grow beyond the 16Mb document limit, these will also have a significant impact on your database storage and fragmentation as the documents grow in size.
The appropriate solution would be to rethink your schema approach so that you do not have unbounded document growth. This probably means flattening the array of "columns" that you are growing so it is represented by a collection of documents rather than an array.
A better (and separate) question to ask would be how to refactor your schema given the expected data growth patterns.

question on gridfs

As one can see in GridFS doc, BSON objects are limited in size. So if I want to store something extremely big, I need to separate it on chunks. It'll be a document in fs.files collection. My question is: is there a way to have huge fields in document. so that it can be found without looking in fs.files collection.
Thank you in advance!
No. BSON documents have a hard 16mb limit so individual fields can never exceed this size limitation. It is exactly that limitation GridFS is working around by transparently chunking a larger file amongst multiple smaller segments.