How to Query GridFS for particular text in MongoDB [duplicate] - mongodb

I have a blogging system that stores uploaded files into the GridFS system. Problem is, I dont understand how to query it!
I am using Mongoose with NodeJS which doesnt yet support GridFS so I am using the actual mongodb module for the GridFS operations. There doesn't SEEM to be a way to query the files metadata like you do documents in a regular collection.
Would it be wise to store the metadata in a document pointing to the GridFS objectId? to easily be able to query?
Any help would be GREATLY appreciated, im kinda stuck :/

GridFS works by storing a number of chunks for each file. This way, you can deliver and store very large files without having to store the entire file in RAM. Also, this enables you to store files that are larger than the maximum document size. The recommended chunk size is 256kb.
The file metadata field can be used to store additional file-specific metadata, which can be more efficient than storing the metadata in a separate document. This greatly depends on your exact requirements, but the metadata field, in general, offers a lot of flexibility. Keep in mind that some of the more obvious metadata is already part of the fs.files document, by default:
> db.fs.files.findOne();
{
"_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"filename" : "2e117dc7f5ba434c90be29c767426c29",
"length" : 486912,
"chunkSize" : 262144,
"uploadDate" : ISODate("2011-10-18T09:05:54.851Z"),
"md5" : "4f31970165766913fdece5417f7fa4a8",
"contentType" : "application/pdf"
}
To actually read the file from GridFS you'll have to fetch the file document from fs.files and the chunks from fs.chunks. The most efficient way to do that is to stream this to the client chunk-by-chunk, so you don't have to load the entire file in RAM. The chunks collection has the following structure:
> db.fs.chunks.findOne({}, {"data" :0});
{
"_id" : ObjectId("4e9d4172b2ceac15506445e1"),
"files_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"n" : 0, // this is the 0th chunk of the file
"data" : /* loads of data */
}
If you want to use the metadata field of fs.files for your queries, make sure you understand the dot notation, e.g.
> db.fs.files.find({"metadata.OwnerId": new ObjectId("..."),
"metadata.ImageWidth" : 280});
also make sure your queries can use an index using explain().

As the specification says, you can store whatever you want in the metadata field.
Here's how a document from the files collection looks like:
Required fields
{
"_id" : <unspecified>, // unique ID for this file
"length" : data_number, // size of the file in bytes
"chunkSize" : data_number, // size of each of the chunks. Default is 256k
"uploadDate" : data_date, // date when object first stored
"md5" : data_string // result of running the "filemd5" command on this file's chunks
}
Optional fields
{
"filename" : data_string, // human name for the file
"contentType" : data_string, // valid mime type for the object
"aliases" : data_array of data_string, // optional array of alias strings
"metadata" : data_object, // anything the user wants to store
}
So store anything you want in the metadata and query it normally like you would in MongoDB:
db.fs.files.find({"metadata.some_info" : "sample"});

I know the question doesn't ask about the Java way of querying for metadata, but here it is, assuming you add gender as a metadata field:
// Get your database's GridFS
GridFS gfs = new GridFS("myDatabase);
// Write out your JSON query within JSON.parse() and cast it as a DBObject
DBObject dbObject = (DBObject) JSON.parse("{metadata: {gender: 'Male'}}");
// Querying action (find)
List<GridFSDBFile> gridFSDBFiles = gfs.find(dbObject);
// Loop through the results
for (GridFSDBFile gridFSDBFile : gridFSDBFiles) {
System.out.println(gridFSDBFile.getFilename());
}

metadata is stored in metadata field. You can query it like
db.fs.files.find({metadata: {content_type: 'text/html'}})

Related

MongoDB / mongoose split large documents

We are extending an existing node+mongo app. We need to add what could be large docs, but we currently do not know how big they could get to.
MongoDB has a default limit to 16mb max size, i am aware we can increase this but would rather not.
Has anyone ever seen a auto doc. split module? Something to auto split the docs into partials if the size exceeds a certain size?
If you have large CSV data to be stored in MongoDB, then there are two approaches which will both work well in different ways:
1: Save in MongoDB format
This means that you have your application read the csv, and write it to a MongoDB collection one row at a time. So each row is saved as a separate document, perhaps something like this:
{
"filename" : "restaurants.csv",
"version" : "2",
"uploadDate" : ISODate("2017-06-15"),
"name" : "Ace Cafe",
"cuisine" : "British",
etc
},
{
"filename" : "restaurants.csv",
"version" : "2",
"uploadDate" : ISODate("2017-06-15"),
"name" : "Bengal Tiger",
"cuisine" : "Bangladeshi",
etc
}
This will take work on your application's part, to render the data into this format and deciding how and where to save the metadata
You can index and query on the data, field by field and row by row
You have no worries about any single document getting too large
2: Save in CSV format using GridFS
This means that your file is uploaded as an un-analysed blob, and automatically divided into 16MB chunks in order to save it in MongoDB documents.
This is easy to do, and does not disturb your original CSV structure
However the data is opaque to MongoDB: you cannot scan it or read it row by row
to work with the data, your application will have to download the entire file from MongoDB and work on it in memory
Hopefully one of these approaches will suit your needs.

MongoDB check the time of document log

I have a collection which stores a array of strings as a part of document and _id , is there a possibility that I can check the timestamp of any of the document which is logged.
the document structure is:
{ "_id" : NumberLong(1370891970), "k" : [ "argos","test"]}
Appreciate your help in advance.
-V
If this is your document structure, then there is no way to check it. None of your fields contains this information and you also overwrite your _id field.

is MongoDB GridFS good solution for storing large amount of files (pdf, xls, doc, dwg etc)?

Currently, we are storing some files for our ERP system in the file system but it's a cumbersome to build a folder structure and query them. There are 10s of 1000s of files. All ERP modules are using mysql.
I'm hoping that a 'bucket' type of storage with some metadata such as GridFs would make things easier. For example:
{"module" : "quotation", "id" : 57894, "file" : "acme_inc_rfq.pdf"}
{"module" : "quotation", "id" : 57894, "file" : "machine_dwg.dwg"}
{"module" : "quotation", "id" : 57894, "file" : "data_sheet.xls"}
{"module" : "po", "id" : 74896, "file" : "our_rfq.xls"}
so I can query module=quotation where id=57894 and get a list of these 3 files, and display links and other operations on them.
Thanks.
GridFS is perfectly suitable for storing large files. Essentially, MongoDB even automatically generates ObjectId fields for new documents, include GridFS files. You'd have to handle any sort of virtual folder structure, one which is presented on your application, in a separate collection.
See MongoDB GridFS for more information.

Querying MongoDB GridFS?

I have a blogging system that stores uploaded files into the GridFS system. Problem is, I dont understand how to query it!
I am using Mongoose with NodeJS which doesnt yet support GridFS so I am using the actual mongodb module for the GridFS operations. There doesn't SEEM to be a way to query the files metadata like you do documents in a regular collection.
Would it be wise to store the metadata in a document pointing to the GridFS objectId? to easily be able to query?
Any help would be GREATLY appreciated, im kinda stuck :/
GridFS works by storing a number of chunks for each file. This way, you can deliver and store very large files without having to store the entire file in RAM. Also, this enables you to store files that are larger than the maximum document size. The recommended chunk size is 256kb.
The file metadata field can be used to store additional file-specific metadata, which can be more efficient than storing the metadata in a separate document. This greatly depends on your exact requirements, but the metadata field, in general, offers a lot of flexibility. Keep in mind that some of the more obvious metadata is already part of the fs.files document, by default:
> db.fs.files.findOne();
{
"_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"filename" : "2e117dc7f5ba434c90be29c767426c29",
"length" : 486912,
"chunkSize" : 262144,
"uploadDate" : ISODate("2011-10-18T09:05:54.851Z"),
"md5" : "4f31970165766913fdece5417f7fa4a8",
"contentType" : "application/pdf"
}
To actually read the file from GridFS you'll have to fetch the file document from fs.files and the chunks from fs.chunks. The most efficient way to do that is to stream this to the client chunk-by-chunk, so you don't have to load the entire file in RAM. The chunks collection has the following structure:
> db.fs.chunks.findOne({}, {"data" :0});
{
"_id" : ObjectId("4e9d4172b2ceac15506445e1"),
"files_id" : ObjectId("4f9d4172b2ceac15506445e1"),
"n" : 0, // this is the 0th chunk of the file
"data" : /* loads of data */
}
If you want to use the metadata field of fs.files for your queries, make sure you understand the dot notation, e.g.
> db.fs.files.find({"metadata.OwnerId": new ObjectId("..."),
"metadata.ImageWidth" : 280});
also make sure your queries can use an index using explain().
As the specification says, you can store whatever you want in the metadata field.
Here's how a document from the files collection looks like:
Required fields
{
"_id" : <unspecified>, // unique ID for this file
"length" : data_number, // size of the file in bytes
"chunkSize" : data_number, // size of each of the chunks. Default is 256k
"uploadDate" : data_date, // date when object first stored
"md5" : data_string // result of running the "filemd5" command on this file's chunks
}
Optional fields
{
"filename" : data_string, // human name for the file
"contentType" : data_string, // valid mime type for the object
"aliases" : data_array of data_string, // optional array of alias strings
"metadata" : data_object, // anything the user wants to store
}
So store anything you want in the metadata and query it normally like you would in MongoDB:
db.fs.files.find({"metadata.some_info" : "sample"});
I know the question doesn't ask about the Java way of querying for metadata, but here it is, assuming you add gender as a metadata field:
// Get your database's GridFS
GridFS gfs = new GridFS("myDatabase);
// Write out your JSON query within JSON.parse() and cast it as a DBObject
DBObject dbObject = (DBObject) JSON.parse("{metadata: {gender: 'Male'}}");
// Querying action (find)
List<GridFSDBFile> gridFSDBFiles = gfs.find(dbObject);
// Loop through the results
for (GridFSDBFile gridFSDBFile : gridFSDBFiles) {
System.out.println(gridFSDBFile.getFilename());
}
metadata is stored in metadata field. You can query it like
db.fs.files.find({metadata: {content_type: 'text/html'}})

MongoDB - DBRef to a DBObject

Using Java ... not that it matters.
Having a problem and maybe it is just a design issue.
I assign "_id" field to all of my documents, even embedded ones.
I have a parent document ( and the collection for those ) which has an embedded document
So I have something like:
{ "_id" : "49902cde5162504500b45c2c" ,
"name" : "MongoDB" ,
"type" : "database" ,
"count" : 1 ,
"info" : { "_id" : "49902cde5162504500b45c2y",
"x" : 203 ,
"y" : 102
}
}
Now I want to have another document which references my "info" via a DBRef, don't want a copy. So, I create a DBRef which points to the collection of the parent document and specifies the _id as xxxx5c2y. However, calling fetch() on the DBRef gives a NULL.
Does it mean that DBRef and fetch() only works on top level collection entry "_id" fields?
I would have expected that fetch() would consume all keys:values within the braces of the document .. but maybe that is asking too much. Does anyone know?? Is there no way to create cross document references except at the top level?
Thanks
Yes, your DBRef _id references need to be to documents in your collection, not to embedded documents.
If you want to find the embedded document you'll need to do a query on info._id and you'll need to add an index on that too (for performance) OR you'll need to store that embedded document in a collection and treat the embedded one as a copy. Copying is OK in MongoDB ... 'one fact one place' doesn't apply here ... provided you have some way to update the copy when the main one changes (eventual consistency).
BTW, on DBRef's, the official guidance says "Most developers only use DBRefs if the collection can change from one document to the next. If your referenced collection will always be the same, the manual references outlined above are more efficient."
Also, why do you want to reference info within a document? If it was an array I could understand why you might want to refer to individual entries but since it doesn't appear to be an array in your example, why not just refer to the containing document by its _id?