Storing large numbers of images in a database? A good experience? - postgresql

I'm writing an app which will store a large number of image (and possibly video) files. After they're uploaded they will be immediately pushed out to some cloud serving CDN for actual serving to the public. The idea is to have the images stored in a reliable, back-uppable store. I would anticipate of the order of 200,000 objects of up to 10KB each and possibly fewer video files of a few MB.
By default I would go to Postgres which the documentation suggests would be ok.
Is this is a sensible idea?
Will it make backing up the database a complete nightmare. Experiences?
Any reliability issues?
Will this affect the performance for other parts of the db? Bear in mind that the db will only be hit once or twice for each image.

I've got experience with storing images in a database this way in Oracle and MySQL. Performance and reliability are not an issue. Backing up is. Your backup will get very large. Since backing up is time consuming and expensive, it might be a good idea to save space. If that means you can shrink your database by 80% by just removing the images from the database, it might be a good idea to store them elsewhere. Backing up separate files is more efficient, because you can easily create incremental backups containing only new and modified images.

I have experiences with PostgreSQL, storing images as ByteA (a BLOB-like datatype), a good experience, and storing images in "dual solution" (images at filesystem, metadata at databases like MySQL and PostgreSQL), that I not recommend.
There are 3 aspects, or architecture considerations, that can help us in our decision:
Unify solution or not? Today, when we see that image volume (sizes and number of images) are growing and growing, in all applications, the "unified solutions" are the goal. Example: Wikimedia is a unified and specialized solution for Wikipedia.
Direct or indirect store? Like old "dual solutions", that not store image into the SQL table, some solutions can use external database or external data pointer... On PostgreSQL BLOB datatypes have indirect store (generates a separated backup), and BYTEA datatype is direct (backup-ed with tables). The choice need technical and performance considerations.
Original or processed images? We need some distinction between "original image" and "processed image", like thumbnail, that need database store (for caching!), but not need backup.
I recommend:
to store as blob (Binary Large OBject with indirect store) at your table: for original image store, but separated backup. See Ivan's answer, PostgreSQL additional supplied modules, How-tos etc.
to store as bytea (or blob), at a separated database (with DBlink): for original image store, at another (unified) database. In this case, I preffer bytea, but blob is near the same. Separating database is the best way for a "unified image webservice".
to store as bytea (BYTE Array with direct store) at your table: for caching processed images (typically thumbnails). Cache the little images to send it fast to the web-browser (avoiding renderization problems) and reduce server processing. Cache also the essential metadata, like width and height. Database caching is the easiest way, but check your needs and server configs (ex. Apache modules): store thumbnails at file system may be better, compare performances. Remember that it is a (unified) web-service, then can be stored at a separete database with no backups, serving many tables. See also PostgreSQL binary data types manual, tests with bytea column, etc.

My experience is limited to SQL server, but I have several million PDF-files that are larger than 10KB in a database, which is still performing quite nicely. Of course indexes are required. Full database backup takes no longer than expected with such an amount of data. Again, this is for MS-SQL server!

Related

Is it efficient to store images inside MongoDB using GridFS?

I know how to do it, but I wonder if it's effective. As I know MongoDB has very efficient clusters and I can flexibly control the collections and the servers they reside on. The only problem is the size of the files and the speed of accessing them through MongoDB.
Should I explore something like Apache Hadoop or if I intelligently cluster MongoDB, will I get similar access speed results?
GridFS is provided for convenience, it is not designed to be the ultimate binary blob storage platform.
MongoDB imposes a limit of 16 MB on each document it stores. This is unlike, for example, many relational databases which permit much larger values to be stored.
Since many applications deal with large binary blobs, MongoDB's solution to this problem is GridFS, which roughly works like this:
For each blob to be inserted, a metadata document is inserted into the metadata collection.
Then, the actual blob is split into 16 MB chunks and uploaded as a sequence of documents into the blob collection.
MongoDB drivers provide helpers for writing and reading the blobs and the metadata.
Thus, on first glance, the problem is solved - the application can store arbitrarily large blobs in a straightforward manner. However, digging deeper, GridFS has the following issues/limitations:
On the server side, documents storing blob chunks aren't stored separately from other documents. As such they compete for cache space with the actual documents. A database which has both content documents and blobs is likely to perform worse than a database that has only content documents.
At the same time, since the blob chunks are stored in the same way as content documents, storing them is generally expensive. For example, S3 is much cheaper than EBS storage, and GridFS would put all data on EBS.
To my knowledge there is no support for parallel writes or parallel reads of the blobs (writing/reading several chunks of the same blob at a time). This can in principle be implemented, either in MongoDB drivers or in an application, but as far as I know this isn't provided out of the box by any driver. This limits I/O performance when the blobs are large.
Similarly, if a read or write fails, the entire blob must be re-read or re-written as opposed to just the missing fragment.
Despite these issues, GridFS may be a fine solution for many use cases:
If the overall data size isn't very large, the negative cache effects are limited.
If most of the blobs fit in a single document, their storage should be quite efficient.
The blobs are backed up and otherwise transfered together with the content documents in the database, improving data consistency and reducing the risk of data loss/inconsistencies.
The good practice is to upload image somewhere (your server or cloud), and then only store image url in MongoDB.
Anyway, I did a little investigating. The short conclusion is: if you need to store user avatars you can use MongoDB, but only if it's a single avatar (You can't store many blobs inside MongoDB) and if you need to store videos or just many and heavy files, then you need something like CephFS.
Why do I think so? The thing is, when I was testing with MongoDB and media files on a slow instance, files weighing up to 10mb(Usually about 1 megabyte) were coming back at up to 3000 milliseconds. That's an unacceptably long time. When there were a lot of files (100+), it could turn into a pain. A real pain.
Ceph is designed just for storing files. To store petabytes of information. That's what's needed.
How do you implement this in a real project? If you use the OOP implementation of MongoDB(Mongoose), you can just add methods to the database objects that access Ceph and do what you need. You can make methods "load file", "delete file", "count quantity" and so on, and then just use it all together as usual. Don't forget to maintain Ceph, add servers as needed, and everything will work perfectly. The files themselves should be accessed only through your web server, not directly, i.e. the web server should throw a request to Ceph when the user needs to give the file and return the response from Ceph to the user.
I hope I helped more than just myself. I'll go add Ceph to my tags. Good luck!
GridFS
Ceph File System
More Ceph

Saving image in database

Is it good to save an image in database with type BLOB?
or save only the path and copy the image in specific directory?
Which way is the best (I mean good performance for the database and the application) and why?
What are your requirements?
In the vast majority of cases saving the path will be better, simply because of the sheer size of the files compared to the rest of data (bulge the DB by GBs due to image inclusion). Consider adding an indirection, eg. save the path as a name and a reference to a storage resource (eg. a storage_id referencing a row in storages tables) and the path attached to the 'storage'. This way you can easily move files (copy all files, then update the storage path, rather than update 1MM individual paths).
However, if your requirements include consistent backup/restore and/or disaster recoverability, is often better to store images in the DB. Is not easier, nor more convenient, but is simply going to be required. Each DB has its own way of dealing with this problem, eg. in SQL Server you would use a FILESTREAM type which allows remote access via file access API. See FILESTREAM MVC: Download and Upload images from SQL Server for an example.
Also, a somehow dated but none the less interesting paper on the topic: To BLOB or Not to BLOB.

Getting large rows out of SQL Azure - but where to go? Tables, Blob or something like MongoDB?

I read through a lot of comparisons between Azure Table/Blob/SQL storage and I think I have a good understanding of all of those ... but still, I'm unsure where to go for my specific needs. Maybe someone with experience in similar scenarios and is able to make a recommendation.
What I have
A SQL Azure DB that stores articles in raw HTML inside a varchar(max) column. Each row also has many metadata columns and many indexes for easy querying. The table contains many references to Users, Subscriptions, Tags and more - so a SQL DB will always be needed for my project.
What's the problem
I already have about 500,000 articles in this table and I expect it to grow by millions of articles per year. Each article's HTML content can be anywhere between a few KB and 1 MB or, in very few cases, larger than 1 MB.
Two problems arise: as Azure SQL storage is expensive, rather earlier than later I'll shoot myself in the head with the costs for storing this. Also, I will hit the 150 GB DB size limit also rather earlier than later. Those 500,000 articles already consume 1,6 GB DB space now.
What I want
It's clear those HTML content has to get out of the SQL DB. While the article table itself has to remain for joining it to users, subscriptions, tags and more for fast relational discovery of the needed articles, at least the colum that holds the HTML content could be outsourced to a cheaper storage.
At first sight, Azure Table storage seems like the perfect fit
Terabytes of data in one large table for very cheap prices and fast queries - sounds perfect to have a singe Table Storage table holding the article contents as an add-on to the SQL DB.
But reading through comparisons here shows it might not even be an option: 64 KB per column would be enough for 98 % of my articles, but there are those 2 % left where for some single articles even the whole 1 MB of the row limit might not be enough.
Blob storage sounds completely wrong, but ...
So there's just one option on Azure left: Blobs. Now, it might not be as wrong as it sounds. In most of the cases, I would need the content of only a single article at once. This should work fine and fast enough with Blob storage.
But I also have queries where I would need 50, 100 or even more rows at once INCLUDING even the content. So I would have to run the SQL query to fetch the needed articles and then fetch every single article out of the Blob storage. I have no experience with that but I can't believe I'd be able to remain in millisecond timespan for the queries when doing that. And queries that take multiple seconds are an absolute no-go for my project.
So it also does not seem to be to be an appropriate solution.
Do I look like a guy with a plan?
At least I have something like a plan. I thought about only "exporting" appropriate records into SQL Table Storage and/or Blob Storage.
Something like "as long as the content is < 64 KB export it to table storage, else keep it in the SQL table (or even export this single XL record into BLOB storage)"
That might work good enough. But it makes things complicated and maybe unnecessary error-prone.
Those other options
There are some other NoSQL DBs like MongoDB and CouchDB that seem to better fit my needs (at least from my naive point of view as someone who just read the specs on paper, I don't have experience with them). But they'd require self-hosting, some thing I'd like to get out of it's way if possible. I'm on Azure to do as little as needed in terms of self-hosting servers and services.
Did you really read until here?
Then thank you very much for your valuable time and thinking about my problems :)
Any suggestions would be greatly appreciated. As you see, I have my ideas and plans, but nothing beats experience from someone who walked down the road before :)
Thanks,
Bernhard
I signed up just solely to help with this question. In the past, I have found useful answers to my problems from Stackoverflow - thank you community - so I thought it would just be fair (perhaps fair is an understatement) to attempt to give something back with this question, as it falls on my alley.
In short, while considering all factors stated in the question, table storage may be the best option - iif you can properly estimate transactions per month: a nice article on this.
You can solve the two limitations that you mentioned, row and column limit, by splitting (plain text method or serializing it) the document/html/data. Speaking from experience with 40 GB+ data stored in Table Storage, where frequently our app retrieves more than 10 rows per each page visit in milliseconds - no argument here! If you need 50+ rows at times, you are looking at low single digits second(s), or you can do them in parallel (and further by splitting the data in different partitions), or in some async fashion. Or, read suggested multi level caching below.
A bit more detail. I tried with SQL Azure, Blob (both page and block), and Table Storage. I can not speak for Mongo DB since, partially for the reasons already mentioned here, I did not want to go that route.
Table Storage is fast; in the range of 20-50 milliseconds, or even faster sometimes (depends, for instance in the same data center i have seen it gone as low as 10 milliseconds), when querying with partition and row key. You may also further have several partitions, in some fashion based on your data and your knowledge about it.
It scales better, in terms of GB's but not transactions
Row and column limitations that you mentioned are a burden, agreed, but not a show stopper. I have written my own solution to split entities, you can too easily, or you can see this already-written-solution (does not solve the whole problem but it is a good start): https://code.google.com/p/lokad-cloud/wiki/FatEntities
Also need to keep in mind that uploading data to table storage is time consuming, even when batching entities due to other limitations (i.e., request size less than 4 MB, upload bandwidth, etc).
But using solely just TableStorage may not be the best solution (thinking about growth and economics). The best solution that we ended up implementing used multi-level caching/storage, starting from static classes, Azure Role Based Cache, Table Storage, and Block Blobs. Lets call this, for readability purposes, level 1A, 1B, 2 and 3 respectively. Using this approach, we are using a medium single instance (2 CPU Cores and 3.5 GB Ram - my laptop has better performance), and are able to process/query/rank 100GB+ of data in seconds (95% of cases in under 1 second). I believe this is fairly impressive given that we check all "articles" before displaying them (4+ million "articles").
First, this is tricky and may or may not be possible in your case. I do not have sufficient knowledge about the data and its query/processing usage, but if you can find a way to organize the data well this may be ideal. I will make an assumption: it sounds like you are trying to search through and find relevant articles given some information about a user and some tags (a variant of a news aggregator perhaps, just got a hunch for that). This assumption is made for the sake of illustrating the suggestion, so even if not correct, I hope it will help you or trigger new ideas on how this could be adopted.
Level 1A data.
Identify and add key entities or its properties in a static class (periodically, depending on how you foresee updates). Say we identify user preferences (e.g., demographics and interest, etc) and tags (tech, politics, sports, etc). This will be used to retrieve quickly who the user is, his/her preferences, and any tags. Think of these as key/value pair; for instance key being a tag, and its value being a list of article IDs, or a range of it. This solves a small piece of a problem, and that is: given a set of keys (user pref, tags, etc) what articles are we interested in! This data should be small in size, if organized properly (e.g., instead of storing article path, you can only store a number). *Note: the problem with data persistence in a static class is that application pool in Azure, by default, resets every 20 minutes or so of inactivity, thus your data in the static class is not persistent any longer - also sharing them across instances (if you have more than 1) can become a burden. Welcome level 1B to the rescue.
Leval 1B data
A solution we used, is to keep layer 1A data in a Azure Cache, for its sole purpose to re-populate the static entity when and if needed. Level 1B data solves this problem. Also, if you face issues with application pool reset timing, you can change that programmatically. So level 1A and 1B have the same data, but one is faster than the other (close enough analogy: CPU Cache and RAM).
Discussing level 1A and 1B a bit
One may point out that it is an overkill to use a static class and cache, since it uses more memory. But, the problem we found in practice, is that, first it is faster with static. Second, in cache there are some limitations (ie., 8 MB per object). With big data, that is a small limit. By keeping data in a static class one can have larger than 8 MB objects, and store them in cache by splitting them (i.e., currently we have over 40 splits). BTW please vote to increase this limit in the next release of azure, thank you! Here is the link: www.mygreatwindowsazureidea.com/forums/34192-windows-azure-feature-voting/suggestions/3223557-azure-preview-cache-increase-max-item-size
Level 2 data
Once we get the values from the key/value entity (level 1A), we use the value to retrieve the data in Table Storage. The value should tell you what partition and Row Key you need. Problem being solved here: you only query those rows relevant to the user/search context. As you can see now, having level 1A data is to minimize row querying from table storage.
Level 3 data
Table storage data can hold a summary of your articles, or the first paragraph, or something of that nature. When it is needed to show the whole article, you will get it from Blob. Table storage, should also have a column that uniquely identifies the full article in blob. In blob you may organize the data in the following manner:
Split each article in separate files.
Group n articles in one file.
Group all articles in one file (not recommended although not as bad as the first impression one may get).
For the 1st option you would store, in table storage, the path of the article, then just grab it directly from Blob. Because of the above levels, you should need to read only a few full articles here.
For the 2nd and 3rd option you would store, in table storage, the path of the file and the start and end position from where to read and where to stop reading, using seek.
Here is a sample code in C#:
YourBlobClientWithReferenceToTheFile.Seek(TableStorageData.start, SeekOrigin.Begin);
int numBytesToRead = (int)TableStorageData.end - (int)TableStorageData.start;
int numBytesRead = 0;
while (numBytesToRead > 0)
{
int n = YourBlobClientWithReferenceToTheFile.Read(bytes,numBytesRead,numBytesToRead);
if (n == 0)
break;
numBytesRead += n;
numBytesToRead -= n;
}
I hope this didn't turn into a book, and hope it was helpful. Feel free to contact me if you have follow up questions or comments.
Thanks!
The proper storage for a file is a blob. But if your query needs to return dozens of blobs at the same time, it will be too slow as you are pointing out. So you could use a hybrid approach: use Azure Tables for 98% of your data, and if it's too large, use a Blob instead and store the Blob URI in your table.
Also, are you compressing your content at all? I sure would.
My thoughts on this: Going the MongoDB (or CouchDB) route is going to end up costing you extra Compute, as you'll need to run a few servers (for high availability). And depending on performance needed, you may end up running 2- or 4-core boxes. Three 4-core boxes is going to run more than your SQL DB costs (plus then there's the cost of storage, and MongoDB etc. will back their data in an Azure blob for duable storage).
Now, as for storing your html in blobs: this is a very common pattern, to offload large objects to blob storage. The GETs should be doable in a single call to blob storage (single transaction) especially with the file size range you mentioned. And you don't have to retrieve each blob serially; you can take advantage of TPL to download several blobs to your role instance in parallel.
One more thing: How are you using the content? If you're streaming it from your role instances, then what I said about TPL should work nicely. If, on the other hand, you're injecting href's into your output page, you can just put the blob url directly into your html page. And if you're concerned about privacy, make the blobs private and generate a short-TTL "shared access signature" granting access for a small time window (this only applies if inserting blob url's into some other html page; it doesn't apply if you're downloading to the role instance and then doing something with it there).
You could use MongoDB's GridFS feature: http://docs.mongodb.org/manual/core/gridfs/
It splits the data into 256k chunks by default (configurable up to 16mb) and lets you use the sharded database as a filesystem which you can use to store and retrieve files. If the file is larger than the chunk size, the mongo db drivers handle splitting up / re-assembling the data when the file needs to be retrieved. To add additional disk space, simply add additional shards.
You should be aware, however that only some mongodb drivers support this and it is a driver convention and not a server feature that allows for this behavior.
A few comments:
What you could do is ALWAYS store HTML content in blob storage and store the blob's URL in table storage. I personally don't like the idea of storing data conditionally i.e. if content of HTML file is more than 64 KB only then store it in blob storage otherwise use table storage. Other advantage you get out of this approach is that you can still query the data. If you store everything in blob storage, you would lose querying capability.
As far as using other NoSQL stores are concerned, only problem I see with them is that they are not natively supported on Windows Azure thus you would be responsible for managing them as well.
Another option would be to store your files as a VHD image in blob storage. Your roles can mount the VHD to their filesystem, and read the data from there.
The complication seems to be that only one VM can have read/write access to the VHD. The others can create a snapshot and read from that, but they won't see updates. Depending on how frequently your data is updated that could work. eg, if you update data at well-known times you could have all the clients unmount, take a new snapshot, and remount to get the new data.
You can also share out a VHD using SMB sharing as described in this MSDN blog post. This would allow full read/write access, but might be a little less reliable and a bit more complex.
you don't say, but if you are not compressing your articles that probably solves your issue then just use table storage.
Otherwise just use table storage and use a unique partition key for each article. If an article's too big put it in 2 rows, as long as you query by partition key you'll get both rows, then use the row key as the index indicating how the articles fit back together
One idea that i have would be to use CDN to store your article content, and link them directly from the client side, instead of any multi phase, operation of getting data from sql then going to some storage.
It would be something like
http://<cdnurl>/<container>/<articleId>.html
Infact same thing can be done with Blob storage too.
The advantage here is that this becomes insanely fast.
Disadvantage here is that security aspect is lost.
Something like Shared Access Signature can be explored for security, but I am not sure how helpful would it be for client side links.

Is there a way to configure Heroku PostgreSQL to not bother loading a particular column into RAM?

This may be a long shot, but I thought I'd ask anyway.
I am looking at using Heroku's new Crane Postgres DB (400 MB RAM Cache) in conjunction with an app I'm deploying on Heroku. The 400 MB cache size should be plenty for our needs... except for one column of one table, in which we store a cached PDF file as a string. The PDF's could easily use up the 400MB RAM pretty quickly if Heroku uses its Cache for them.
If I were on an actual server, I'd just store the PDF as a file, but given Heroku's ephemeral file system, my life is much simpler if I just store the pdf in the DB rather than rigging up a connection to S3 just for this one thing. (It further complicates that we're looking at deploying multiple heroku instances, one for each client ... so using the DB's is simpler than creating a new bucket for each one.) I don't really care about the speed on this. If people are getting the file, they will expect speeds as if it were coming from a file system anyhow, since thats how most file downloads are done. Is there any way to tell PostGRES to not bother caching this column?
Or maybe I'm asking the wrong question, and there is some other way to solve the problem or design alternatives that make it irrelevant.
You don't have to do anything. PostgreSQL will automatically use TOAST on values larger than 8 kB.
From http://www.postgresql.org/docs/9.1/static/storage-toast.html
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").
PostgreSQL caching is also done at the page level so TOAST does not have to be cached with the rest of the row (http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf).
The fact that Postgres can TOAST large field values, it doesn't mean it's the best thing to do.
If you store big fields in your main database, it will make many things harder, such as creating forks or followers, and creating and restoring backups in particular. I would strongly reconsider utilizing S3 to store the PDF files, and simply invest in automated onboarding of new clients (create heroku app, provision database, provision/create S3 bucket).
I'm not quite sure how you're managing to store large PDF's, since Postgres imposes a maximum field size (or at least a maximum page size). However, you might be able to get around this by using TOAST. TOASTed items are stored in a separate (physical) table, so if you're not selecting them frequently they shouldn't be cached.
If you are selecting them frequently, then I'm not sure if what you want is possible. Remember that Postgres only supplies one "level" of caching - the Linux VFS does caching also.

Does PostgreSQL support transparent compressing of tables (fragments)?

I'm going to store large amount of data (logs) in fragmented PostgreSQL tables (table per day). I would like to compress some of them to save some space on my discs, but I don't want to lose the ability to query them in the usual manner.
Does PostgreSQL support such a transparent compression and where can I read about it in more detail? I think there should be some well-known magic name for such a feature.
Yes, PostgreSQL will do this automatically for you when they go above a certain size. Compression is applied at each individual data value though - not at the full table level. Meaning that if you have a billion rows that are very narrow, they won't get compressed. Or if you have very many columns each with only a small value in it, they won't get compressed. Details about this scheme in the manual.
If you need it on the full table level, a solution is to create a TABLESPACE for those tables that you want compressed, and point it to a compressed filesystem. As long as the filesystem still obeys fsync() and standard POSIX semantics, this should be perfectly safe. Details about this in the manual.
Probably not what you have in mind but still useful info - Chapter 53. Database Physical Storage of the fine manual. The TOAST section warrants further attention.