GridFS and Cloning to another server - mongodb

I have a local MongoDB database that I am starting to put some files into GridFS for caching purposes. What I want to know is:
Can I use db.cloneCollection() on another server to clone my fs.* collections? If I do that will the GridFS system on that server work properly? Essentially I have to "pull" data from another machine that has the files in GridFS, I can't direcly add them easily to the production box.
Edit: I was able to get on my destination server and use the following commands from the mongo shell to pull the GridFS system over from another mongo system on our network.
use DBName
db.cloneCollection("otherserver:someport","fs.files")
db.cloneCollection("otherserver:someport","fs.chunks")
For future reference.

The short answer is of course you can, it is only a collection and there is nothing special about it at all. The longer form is explaining what GridFS actually is.
So the very first sentence on the manual page:
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.
GridFS is not something that "MongoDB does", internally to the server it is basically just two collections, one for the reference information and one for the "chunks" that are used to break up the content so no individual document exceeds the 16MB limit. But most importantly here is the word "specification".
So the server itself does no magic at all. The implementation to store reference data and chunks is all done at the "driver" level, where in fact you can name the collections you wish to use rather than just accept the defaults. So when reading and writing data, it is the "driver" that does the work by pulling the "chunks" contained in the reference document or creating new "chunks" as data is sent to the server.
The other common misconception is that GridFS is the only method for dealing with "files" when sending content to MongoDB. Again in that first sentence, it actually exists as a way to store content that exceeds the 16MB limit for BSON documents.
MongoDB has no problem directly storing binary data in a document as long as the total document does not exceed the 16MB limit. So in most use cases ( small image files used on websites ) the data would be better stored in ordinary documents and thus avoid the overhead of needing to read and write with multiple collections.
So there is no internal server "magic". These are just ordinary collections that you can query, aggregate, mapReduce and even copy or clone.

Related

How to store lookup values in MongoDB?

I have a collection in db which represents mediafiles.
And among other info I shoud store format name. I wonder if there best practices to store info like that. Is it better to create new collection for file formats and use link to that collection or to store format name right in file documents as a plain text? What about perfomance and compression? It supposed to be more than a billion documents in db. What would mongo expers suggest in this situation?
Embedded documents are the preferred approach.
In your case, it means it is better to store file format in the same collection.
Putting the file format into the separate collection means creating a new file on the disk.
It is a slower option and should be used if your document ( any of them ) exceeds 16 MB in size.
See these links for more information
6 Rules of Thumb for MongoDB Schema Design
and
How to Program with MongoDB Using the .NET Driver
I've done some benchmarks and figured out that in my case storing "lookup values" as plaintext is more efficient in terms of disk space than embedded document and than reference to outstanding collection. Sorry for poor terminology.

Can I store data that won't affect query performance in MongoDB?

We have an application which requires saving of data that should be in documents, for querying and sorting purposes. The data should be schema less, as some of the fields would be known only via usage. For this, MongoDB is a great solution and it works great for us.
Part of the data in each document, is for displaying purposes. Meaning the data can be objects (let's say json) that the client side uses in order to plot diagrams.
I tried to save this data using gridfs, but the use cases makes it not responsive enough. Also, the documents won't exceed the 16 MB limits even with the diagram data inside them. And in fact, while trying to save this data directly within the documents, we got better results.
This data is used only for client side responses, meaning we should never query it. So my question is, can I insert this data to MongoDB, and set it as a 'not for query' data? Meaning, can I insert this data without affecting Mongo's performance? The data is strict and once a document is inserted, there might be only updating of existing fields, not adding new ones.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for?
Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
As at MongoDB 3.4, read and write operations are atomic on the level of a single document from the storage/memory point of view. If the MongoDB server needs to fetch a document from memory or disk (even when projecting a subset of fields to return) the full document generally has to be loaded into memory on a mongod. The only exception is if you can take advantage of covered queries where all of the fields returned are also included in the index used.
This data is used only for client side responses, meaning we should never query it.
Data fields which aren't queried directly do not need to be in any indexes. However, there is currently no concept like "not for query" fields in MongoDB. You can query or project any field (with or without an index).
Meaning, can I insert this data without affecting Mongo's performance?
Data with very different access or growth patterns (such as your infrequently requested client data) is a recommended candidate for storing separately from a parent document with frequently accessed data. This will improve the efficiency of memory usage for mongod by avoiding unnecessary retrieval of data when working with documents in the parent collection.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for? Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
You should use a type that is most appropriate for the data that you are storing. Storing text data as binary will not gain you any obvious efficiencies in server storage. However, storing a complex object as a single value (for example, a JSON document serialized as a string) could save some serialization overhead if that object will only be interpreted via your client-side code. Binary data stored in MongoDB will be an opaque blob as far as indexing or querying, which sounds fine for your purposes.

Mongodb to Mongodb GridFS

I'm new to mongodb. I wanted to know if I initially code my app using mongodb and later I want to switch to mongodb gridfs, will the switching (of a filled large database) be possible.
So, if I am using mongo db initially and after some time of running the app the database documents exceed the size of 16Mb, I guess I will have to switch to gridfs. I want to know how easy or difficult will it be to switch to gridfs and whether that will be possible?
Thanks.
GridFS is used to store large files. It internally divides data in chunks(By default 255 KB). Let me give you an example of saving a pdf file in MongoDB using both ways. I am assuming the size of pdf as 10 MB so that we can see both normal way and GridFS way.
Normal Way:
Say you want to store it in normal_book collection in testDB database. So, whole pdf is stored in this collection and when you want to fetch it using db.normal_book.find(), whole pdf will be fetched in memory.
GridFS way:
In GridFS, we have two collections, one is for storing data and other is for storing its metadata. It will store data in fs.chunks collection and metadata in fs.filescollection. Now, the beauty of GridFS is that you can find the whole file at once or you can find chunks individually.
Now coming to your question, there is no direct way or property to
tell MongoDB that now I want to switch to GridFS. You need to
reinsert data in GridFS using mongofiles command-line tool or
using MongoDB's drivers.

Using MongoDB's for storing files of size est. 500KB

In GridFS FAQ there is said that one should store in aforementioned GridFS files of size >16MB. I have a lot of files ~500KB.
Question is: which approach is more efficient - storing files' content inside document or storing file itself in GridFS? Should I consider other approaches?
As for efficiency, either approach is the same. GridFS is implemented at the driver level by paging your >16MB data across multiple documents. MongoDB is unaware that you're storing a "file", it just knows how to store documents and doesn't ask questions.
So, depending on your driver (PHP/NodeJS/Ruby), you may find some metadata features nice and opt to use GridFS because of that. Otherwise, if you are absolutely sure a document will not be larger than 16MB, storing the raw content in the document should be fairly simple and just as fast (or faster).
Generally, I'd recommend against storing files in the database. It can have a negative impact on your working set and overall speed.

how does mongodb do a 42T drive per node

We had heard mongodb had one client with 42T per node and I am wondering more about this. I know cassandra has Bloomfilters that skipp hitting disk to find out which file a row might be in.
Does mongodb have something similar to bloomfilters?
IS mongodb using something similar to SSTables?
I did read mongodb does compaction just like cassandra, I would think this would be an awfully long process with a 42T node????
I guess I don't know what terms to search for as I research mongodb here(in cassandra they are called SSTables).
thanks,
Dean
MongoDB does not support online compaction. In fact, data fragmentation is a current problem in systems with many doc updates. To prevent data fragmentation MongoDB tries to calculate an automated padding factor, minimizing the number of data moves.
The compact command blocks the entire database until it finished. Besides, MongoDB does not support dictionary compression, so field names takes space on every object stored. I guess the layout used by MongoDB is not any fancy data structure. It's simply composed of header (offset, length...), bson data and padding factor.
Since MongoDB is not a key/value or columnar database it doesn't use SSTables (efficient data structure for columnar layout). Every file created for the database is named "extent".
AFAIK, MongoDB doesn't use bloom filters.