Saving image in database - database-performance

Is it good to save an image in database with type BLOB?
or save only the path and copy the image in specific directory?
Which way is the best (I mean good performance for the database and the application) and why?

What are your requirements?
In the vast majority of cases saving the path will be better, simply because of the sheer size of the files compared to the rest of data (bulge the DB by GBs due to image inclusion). Consider adding an indirection, eg. save the path as a name and a reference to a storage resource (eg. a storage_id referencing a row in storages tables) and the path attached to the 'storage'. This way you can easily move files (copy all files, then update the storage path, rather than update 1MM individual paths).
However, if your requirements include consistent backup/restore and/or disaster recoverability, is often better to store images in the DB. Is not easier, nor more convenient, but is simply going to be required. Each DB has its own way of dealing with this problem, eg. in SQL Server you would use a FILESTREAM type which allows remote access via file access API. See FILESTREAM MVC: Download and Upload images from SQL Server for an example.
Also, a somehow dated but none the less interesting paper on the topic: To BLOB or Not to BLOB.

Related

Is it efficient to store images inside MongoDB using GridFS?

I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph

Why moodledata directory has this structure?

I know moodle's internal files such as uploaded images are stored in moodledata directory.
Inside, there are several directories:
moodledata/filedir/1c/01/1c01d0b6691ace075042a14416a8db98843b0856
moodledata/filedir/63/
moodledata/filedir/63/89/
moodledata/filedir/63/89/63895ece79c4a91666312d0a24db82fe3017f54d
moodledata/filedir/63/3c/
moodledata/filedir/63/37/
moodledata/filedir/63/a7/
What are these hashses?
What are the design reasons behind this design, in oposition with, for example, wordpress /year/month/file.jpg structure?
https://docs.moodle.org/dev/File_API_internals#File_storage_on_disk
Simple answer - files are stored based on the hash of their content (inspired by the way Git stores files internally).
This means that if you have the same file in multiple places (e.g. the same PDF or image in multiple courses), it is stored only once on the disk, even if the original filename is different.
On real sites this can involve a huge reduction in disk usage (obviously dependent on how much duplication there is on your site).
Moodledata files are stored according to the SHA1 encoding of their content, in order to prevent the duplication of content (for example, when the same file is uploaded twice with a different name).
For further explanations of how to handle such files, you can read the official documentation of the File API :
https://docs.moodle.org/dev/File_API_internals
especially the File storage on disk part.

How to use RavenDB to store database of images

I am redesigning a .NET system that stores >1000 multipage .TIF images per day for retrieval by desktop and web systems. The current system uses SQL server to store the image IDs and metadata, but the actual TIF files are stored in NTFS directories on a separate file server.
Would a NoSQL database like RavenDB be a good choice for storing the TIFs as a 'document', and the image ID as the key?
Images are usually written once, but read many times. I am hoping RavenDB would be a good choice for storing images since I am trying to improve the current system with:
- redundancy (using replication to automatically make another copy of the repository)
- performance (using key/value nature of the images)
- reliability (the current NTFS file system 'database' is error prone and fragile)
Pete,
What you are talking about is not documents, because the images are binary data.
What you can do is use RavenDB attachments feature to store those images.
Attachments can take part of replication as well.

Storing large numbers of images in a database? A good experience?

I'm writing an app which will store a large number of image (and possibly video) files. After they're uploaded they will be immediately pushed out to some cloud serving CDN for actual serving to the public. The idea is to have the images stored in a reliable, back-uppable store. I would anticipate of the order of 200,000 objects of up to 10KB each and possibly fewer video files of a few MB.
By default I would go to Postgres which the documentation suggests would be ok.
Is this is a sensible idea?
Will it make backing up the database a complete nightmare. Experiences?
Any reliability issues?
Will this affect the performance for other parts of the db? Bear in mind that the db will only be hit once or twice for each image.
I've got experience with storing images in a database this way in Oracle and MySQL. Performance and reliability are not an issue. Backing up is. Your backup will get very large. Since backing up is time consuming and expensive, it might be a good idea to save space. If that means you can shrink your database by 80% by just removing the images from the database, it might be a good idea to store them elsewhere. Backing up separate files is more efficient, because you can easily create incremental backups containing only new and modified images.
I have experiences with PostgreSQL, storing images as ByteA (a BLOB-like datatype), a good experience, and storing images in "dual solution" (images at filesystem, metadata at databases like MySQL and PostgreSQL), that I not recommend.
There are 3 aspects, or architecture considerations, that can help us in our decision:
Unify solution or not? Today, when we see that image volume (sizes and number of images) are growing and growing, in all applications, the "unified solutions" are the goal. Example: Wikimedia is a unified and specialized solution for Wikipedia.
Direct or indirect store? Like old "dual solutions", that not store image into the SQL table, some solutions can use external database or external data pointer... On PostgreSQL BLOB datatypes have indirect store (generates a separated backup), and BYTEA datatype is direct (backup-ed with tables). The choice need technical and performance considerations.
Original or processed images? We need some distinction between "original image" and "processed image", like thumbnail, that need database store (for caching!), but not need backup.
I recommend:
to store as blob (Binary Large OBject with indirect store) at your table: for original image store, but separated backup. See Ivan's answer, PostgreSQL additional supplied modules, How-tos etc.
to store as bytea (or blob), at a separated database (with DBlink): for original image store, at another (unified) database. In this case, I preffer bytea, but blob is near the same. Separating database is the best way for a "unified image webservice".
to store as bytea (BYTE Array with direct store) at your table: for caching processed images (typically thumbnails). Cache the little images to send it fast to the web-browser (avoiding renderization problems) and reduce server processing. Cache also the essential metadata, like width and height. Database caching is the easiest way, but check your needs and server configs (ex. Apache modules): store thumbnails at file system may be better, compare performances. Remember that it is a (unified) web-service, then can be stored at a separete database with no backups, serving many tables. See also PostgreSQL binary data types manual, tests with bytea column, etc.
My experience is limited to SQL server, but I have several million PDF-files that are larger than 10KB in a database, which is still performing quite nicely. Of course indexes are required. Full database backup takes no longer than expected with such an amount of data. Again, this is for MS-SQL server!

coredata vs file access

I have 100s of file which needs to be accessed for displaying the content on iphone. They are all plists.
Which one is faster core data or file access ? which one is secured ?
You have to consider the file size first, a nice rule of thumb found in these boards is, if the file is under 100kB you can store it as an attribute in an entity as a BLOB, if it is greater that that you maybe want to create a ad-hoc entity for it, and in the end if it exceeds 1 MB in size you can access it through the filesystem.
Secondly, you shall evaluate the cost of the operation too, 100 files may appear many but if you access them few times, maybe file access is the way to go, on the other hand if you need that stored information multiple times frequently but you can even create ad hoc entities for Core Data and load the files at start up. And so on.
This is a nice book on Core Data. You can find many guide lines by reading it, but keep in mind also the general guide lines of designing databases.
If they are static files I would recommend pre-loading them into a Core Data SQLite file. That would yield far better performance, especially if you structure your model properly.