In my Spring Boot application I used GridFS to store large file in my database. To find certain files, I use normal queries on the files collection:
GridFSFile file = gridFsTemplate.findOne(Query.query(Criteria.where(ID).is(id)));
but with this approach I'm getting the entire file.
My question is, how to create some queries without loading the whole file in the memory?
My stored files are books ( in pdf format ) and suppose I want to get the content from certain page without loading the entire book in the memory.
I'm guessing I'll have to use the chunk collection and perform some operations to the chunks but I cannot find how to do that.
GridFS is described here. Drivers do not provide a standardized API for retrieving parts of the file, but you can read that spec and construct your own queries that would retrieve portions of the written chunks.
Your particular driver may provide partial file retrieval functionality, consult its docs for that.
Related
I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph
To have a large dataset with images and videos, I would like to use Apache Xindice. There are very few tutorials and guides on WWW for Apache Xindece. How to store image and video files in Apache Xindice? Is Apache Xindice suitable to stroe large set of data? Is there any latest repository which can store large set of data in XML format (Not SQL type of databases. Should save TB size data)? Can I use MongoDB for storing large dataset?
I suggest to store external documents (images/videos, XML files) in MongoDB, using the GridFS file system. GirdFS collection consist of two parts: the chunks collection, where the binary data are stored, and the files collection, holding the information about the files, including customer defined metadata. From the FAQ:
In some situations, storing large files may be more efficient in a
MongoDB database than on a system-level filesystem.
If your filesystem limits the number of files in a directory, you can
use GridFS to store as many files as needed. When you want to keep
your files and metadata automatically synced and deployed across a
number of systems and facilities.
When using geographically
distributed replica sets MongoDB can distribute files and their
metadata automatically to a number of mongod instances and
facilitates.
When you want to access information from portions of
large files without having to load whole files into memory, you can
use GridFS to recall sections of files without reading the entire file
into memory.
For large data sets, GridFS can be sharded (see http://docs.mongodb.org/manual/core/sharded-cluster-internals/#sharding-gridfs-stores).
For fast delivery of GridFS data, there are modules for ngnix (ngnix-gridfs) and Apache (mod_gridfs). See also http://nosql.mypopescu.com/post/28085493064/mongodb-gridfs-over-http-with-mod-gridfs for a quick comparison
At the moment, we store a huge amount of logs (30G/Day x3 Machines = av. 100G) of a filer. Logs are zipped.
The actual tool to search that logs, is searching the corresponding logs (according to timerange), copying them localy, unzip them, and search the xml for information and display.
We are studying the possibility to make a spunk-like tool to search that logs (it is the output of the message bus : xml-messages sent to other systems).
What are the advantage to rely on a mongo-like db, instead of querying the zipped logfile directly ?
We could also index some data in a db, and let the program search on targeted zip files...
What brings a mongodb... or hadoop more ?
I have worked on MongoDB and currently working on Hadoop so I can list some differences that you might find interesting.
MongoDB will need you to store your files as documents (instead of raw text data). HDFS can store it as files and allow you to use custom MapReduce programs to process them.
MongoDB will require you to choose a good sharding key in order to efficiently distribute the load across the cluster. Since you are storing log files it might be difficult.
If you can store the logs formatted into documents in MongoDB it will allow you query the data with very low latency across huge amounts of logs. My last project had inbuilt logging based on MongoDB and analysis is extremely fast as compared to MapReduce analysis of raw text logs. But the logging has to be done from ground up.
In Hadoop you have technologies like Hive, HBase and Impala which will help you analyze the text format logs, but the latency of MapReduce needs to be kept in mind (there are ways to optimize the latency in though).
To summarize: If you can implement mongoDB based logging in the entire stack go for MongoDB but if you already have text format logs then go for Hadoop. If you can convert your XML data into MongoDB documents in realtime then you can get a very efficient solution.
My knowledge of Hadoop is limited, so I will focus on MongoDB.
You could store each log entry in MongoDB. When you create an index on the time field, you can easily get a specific time range. MongoDB will have support for full text search in version 2.4 which would certainly be an interesting feature for your use-case, but it isn't production-ready yet. Until then, searching for substrings is a very slow operation. So you would have to convert the XML trees which are relevant for your searches to mongodb objects and create indices for the most searched fields.
But you should be aware that storing your logs in MongoDB will mean that you will need a lot more hard drive space. MongoDB does not compress the payload data and also adds some own meta-data overhead, so it will require even more disk space than the unzipped logs. Also, when you use the new text search feature, it will take even more disk space. During a presentation I saw, the text index was two times as large as the data it was indexing. Sure, this feature is still work in progress, but I wouldn't bet on it becomming a lot less in the final version.
I'm working on a video server, and I want to use a database to keep video files.
Since I only need to store simple video files with metadata I tried to use MongoDB in Java, via its GridFS mechanism to store the video files and their metadata.
However, there are two major features I need, and that I couldn't manage using MongoDB:
I want to be able to add to a previously saved video, since saving a video might be performed in chunks. I don't want to delete the binary I have so far, just append bytes at the end of an item.
I want to be able to read from a video item while it is being written. "Thread A" will update the video item, adding more and more bytes, while "Thread B" will read from the item, receiving all the bytes written by "Thread A" as soon as they are written/flushed.
I tried writing the straightforward code to do that, but it failed. It seems MongoDB doesn't allow multi-threaded access to the binary (even if one thread is doing all the writing), nor could I find a way to add to a binary file - the Java GridFS API only gives an InputStream from an already existing GridFSDBFile, I cannot get an OutputStream to write to it.
Is this possible via MongoDB, and if so how?
If not, do you know of any other DB that might allow this (preferably nothing too complex such as a full relational DB)?
Would I be better off using MongoDB to keep only the metadata of the video files, and manually handle reading and writing the binary data from the filesystem, so I can implement the above requirements on my own?
Thanks,
Al
I've used mongo gridfs for storing media files for a messaging system we built using Mongo so I can share what we ran into.
So before I get into this for your use case scenario I would recommend not using GridFS and actually using something like Amazon S3 (with excellent rest apis for multipart uploads) and store the metadata in Mongo. This is the approach we settled on in our project after first implementing with GridFS. It's not that GridFS isn't great it's just not that well suited for chunking/appending and rewriting small portions of files. For more info here's a quick rundown on what GridFS is good for and not good for:
http://www.mongodb.org/display/DOCS/When+to+use+GridFS
Now if you are bent on using GridFS you need to understand how the driver and read/write concurrency works.
In mongo (2.2) you have one writer thread per schema/db. So this means when you are writing you are essentially locked from having another thread perform an operation. In real life usage this is super fast because the lock yields when a chunk is written (256k) so your reader thread can get some info back. Please look at this concurrency video/presentation for more details:
http://www.10gen.com/presentations/concurrency-internals-mongodb-2-2
So if you look at my two links essentially we can say quetion 2 is answered. You should also understand a little bit about how Mongo writes large data sets and how page faults provide a way for reader threads to get information.
Now let's tackle your first question. The Mongo driver does not provide a way to append data to GridFS. It is meant to be a fire/forget atomic type operation. However if you understand how the data is stored in chunks and how the checksum is calculated then you can do it manually by using the fs.files and fs.chunks methods as this poster talks about here:
Append data to existing gridfs file
So going through those you can see that it is possible to do what you want but my general recommendation is to use a service (such as Amazon S3) that is designed for this type of interaction instead of trying to do extra work to make Mongo fit your needs. Of course you can go to the filesystem directly as well which would be the poor man's choice but you lose redundancy, sharding, replication etc etc that you get with GridFS or S3.
Hope that helps.
-Prasith
I need a suggest how to operate with large amount of data on iPhone. Let say I have xml file with ~120k text records. I need to perform search on this data. The solution i have tried is to use Core Data to store information in sorted order in caches. And then use binary search which works fast. But the problem is to build this caches. On first launch application takes about 15-25 seconds to build this caches. Maybe I need to use different approach to search the data?
Thanks in advance.
If you're using an XML file with the requirement that you can't cache, then you're not going to succeed unless you somehow carefully format your XML file to have useful data traversal properties -- but then you may as well use a binary file that's more useful unless you have some very esoteric requirements.
Really what you want is one of the typical indexing algorithms (on disk hash, B-tree, etc) from the get-go.
However...
If you have to read in and parse your XML text file, then you can skirt using a typical big and slow generic XML parser and write a fast hackish version since most of the data records you'll need to recognize are probably formatted the same way over and over. Nothing special, just find where the relevant data fields start, grab the data until it ends, move on to the next data field.
Honestly, 120k of text isn't very much-- sounds like whatever XML parser you're using is just slow. (I use this trick all the time for autogenerated XML data that just represents things like tables or simple data records -- my own parser is faster than any generic XML parser.)
This is probably the solution you actually want since you sound fairly attached to the XML file format. It won't be as error-proof as a generic XML parser if you're not careful, however it will eat that 120KB file up like nobody's business. And it's entry level CS work -- read in a file with certain specific formatting and grab the data values from it. Regexps are your friend if you have access to them.
Try storing and doing your searches in the cloud. (using a database stored on a server somewhere)
Unless you specifically need ALL of the information on the device..