Best DB solution for storing large files - postgresql

I must provide a solution where user can upload files and they must be stored together with some metadata, and this may grow really big.
Access to these files must be controlled, so they want me to just store them in DB BLOBs, but I fear PostgreSQL won't handle it properly over time.
My first idea was use some NoSQL DB solution, but I couldn't find any that would replace a good RDBMS and elegantly store files together. Then I thought on just saving these files in HD somewhere WebServer won't serve them, name them their table ID, and just load them on RAM and print them with proper content-type.
Could anyone suggest me any better solution for this?

I had the requirement to store many images (with some meta data) and allow controlled access to them, here is what I did.
To the cloud™
I save the image files in Amazon S3. My local database holds the metadata with the S3 location of the file as one column. When an authenticated and authorized user needs to see the file they hit a URL in my system (where the authentication and authorization checks occur) which then generates a pre-signed, expiring URL for the image and sends a redirect back to the browser. The browser is then able to load the image for a given amount of time (as specified in the signature within the URL.)
With this solution I have user level access to the resources and I don't have to store them as BLOBs or anything like that which may grow unwieldy over time. I also don't use MY bandwidth to stream the files to the client and get cheap, redundant storage for them. Obviously the suitability of this solution will depend on the nature of the binary files you are looking to store and your level of trust in Amazon. The world doesn't end if there is a slip and someone sees an image from my system they shouldn't. YMMV.

Related

Find out since how long the cache has been stored in IndexDB?

I am using a web application for doing data entry which has a mechanism for storing the data entry form (which is an html form) in the browser cache IndexDB.
I am able to see the form in the browser dev tool like so :
I want to know for how long the Index DB will be able to store the form in the browser? Is it possible that it is months since the browser cache was same? Will closing the browser clear the keys? or is this persistent enough storage to last for a few months?
Is it possible to find out when(the exact date or time) the cache entry was made in the IndexDB?
I am asking this because I suspect some discperancy in the form for some of our users as the data being sent is a little different than expected.
Any help would is appreciated.
Thanks
DHIS2, the application you are referring to, has an application you and other users can use to clear any cached data. This app is named "Browser Cache Cleaner", and gives you a list of different things to clear. I would try this app and see if your users still have these issues.
Databases don't expose the timestamp of when the database record was last modified. That's something the developer needs make the application to store in the database records. For example, one could have created_at and modified_at columns to track when the record was created and when was it last modified.
IndexedDB is a persistent client storage API, so yes, data will stay permanently unless the user clears the browser's cache.
If there is some discrepancy in the form being sent, I would look at the caching strategy. Offline data caching is a pretty broad topic (also I don't know much about your application), but Google's Offline Cookbook is a good place to start digging in this topic, as long as caching strategies for your use.

REST - GET-Respone with temporary uploaded File

I know that the title is not that correct, but i don't know how to name this problem...
Currently I'm trying to design my first REST-API for a conversion-service. Therefore the user has an input file which is given to the server and gets back the converted file.
The current problem I've got is, that the converted file should be accessed with a simple GET /conversionservice/my/url. However it is not possible to upload the input file within GET-Request. A POST would be necessary (am I right?), but POST isn't cacheable.
Now my question is, what's the right way to design this? I know that it could be possible to upload the input file before to the server and then access it with my GET-Request, but those input files could be everything!
Thanks for your help :)
A POST request is actually needed for a file upload. The fact that it is not cachable should not bother the service because how could any intermediaries (the browser, the server, proxy etc) know about the content of the file. If you need cachability, you would have to implement it yourself probably with a hash (md5, sha1 etc) of the uploaded file. This would keep you from having to perform the actual conversion twice, but you would have to hash each file that was uploaded which would slow you down for a "cache miss".
The only other way I could think of to solve the problem would be to require the user to pass in an accessible url to the file in the query string, then you could handle GET requests, but your users would have to make the file accessible over the internet. This would allow caching but limit the usability.
Perhaps a hybrid approach would be possible where you accepted a POST for a file upload and a GET for a url, this would increase the complexity of the service but maximize usability.
Also, you should look into what caches you are interested in leveraging as a lot of them have limits on the size of a cache entry meaning if the file is sufficiently large it would not cache anyway.
In the end, I would advise you to stick to the standards already established. Accept the POST request for the file upload and if you are interested in speeding up the user experience maybe make the upload persist, this would allow the user to upload a file once and download it in many different formats.
You sequence of events can be as follows:
Upload your file/files using POST. For immediate response, you can return required information using your own headers. (It should return document key to access the file for future use.)
Then you can use GET for further operations using the above mentioned document key as a query string.

Need advice: How to share a potentially large report to remote users?

I am asking for advice on possibly better solutions for the part of the project I'm working on. I'll first give some background and then my current thoughts.
Background
Our clients can use my company's products to generate potentially large data sets for use in their industry. When the data sets are generated, the clients will file a processing request to us.
We want to send the clients a summary email which contains some statistical charts as well as sampling points from the data sets so they can do some initial quality control work. If the data sets are of bad quality, they don't need to file any request.
One problem is that the charts and sampling points can be potentially too large to be sent in an email. The charts and the sampling points we want to include in the emails are pictures. Although we can use low-quality format such as JPEG to save space, we cannot control how many data sets would be included in the summary email, so the total size could still exceed the normal email size limit.
In terms of technologies, we are mainly developing in Python on Ubuntu 14.04.
Goals of the Solution
In general, we want to present a report-like thing to the clients to do some initial QA. The report may contains external links but does not need to be very interactive. In other words, a static report should be fine.
We want to reduce the steps or things that our clients must do to read the report. For example, if the report can be just an email, the user only needs to 1). log in and 2). open the email. If they use a client software, they may skip 1). and just open and begin to read.
We also want to minimize the burden of maintaining extra user accounts for both us and our clients. For example, if the solution requires us to register a new user account, this solution is, although still acceptable, not ranked very high.
Security is important because our clients don't want their reports to be read by unauthorized third parties.
We want the process automated. We want the solution to provide programming interface so that we can automate the report sending/sharing process.
Performance is NOT a critical issue. Our user base is not large. I think at most in hundreds. They also don't generate data that frequently, at most once a week. We don't need real-time response. Even a delay of a few hours is still acceptable.
My Current Thoughts of Solution
Possible solution #1: In-house web service. I can set up a server machine and develop our own web service. We put the report into our database and the clients can then query via the Internet.
Possible solution #2: Amazon Web Service. AWS is quite mature but I'm not sure if they could be expensive because so far we just wanna share a report with our remote clients which doesn't look like a big deal to use AWS.
Possible solution #3: Google Drive. I know Google Drive provides API to do uploading and sharing programmatically, but I think we need to register a dedicated Google account to use that.
Any better solutions??
You could possibly use AWS S3 and Cloudfront. Files can easily be loaded into S3 using the AWS SDK's and API. You can then use the API to generate secure links to the files that can only be opened for a specific time and optionally from a specific IP.
Files on S3 can also be automatically cleaned up after a specific time if needed using lifecycle rules.
Storage and transfer prices are fairly cheap with AWS and remember that the S3 storage cost indicated is by the month so if you only have an object loaded for a few days then you only pay for a few days.
S3: http://aws.amazon.com/s3/pricing
Cloudfront: https://aws.amazon.com/cloudfront/pricing/
Here's a list of the SDK's for AWS:
https://aws.amazon.com/tools/#sdk
Or you can use their command line tools for Windows batch or powershell scripting:
https://aws.amazon.com/tools/#cli
Here's some info on how the private content urls are created:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html
I will suggest to built this service using mix of your #1 and #2 options. You can do the processing and for transferring the data leverage AWS S3 which is quiet cheap.
Example: 100GB costs like approx $3.
Also AWS S3 will be beneficial as you are covered for any disaster on your local environment your data will be safe in S3.
For security you can leverage data encryption and signed URLS in AWS S3.

While scaling up, how to make user uploaded files available accross multiple servers?

I have a website in which users would upload various and later access them.
The files are stored in a specific path in the server at this point. Now if I need to have multiple servers for the website, what is the best way to make the user uploaded files accessible across multiple servers. Amazon s3 is one option that has crossed my mind. What other options do I have?
First, you can try using a CDN (http://en.wikipedia.org/wiki/Content_delivery_network).
Also, you can make it in house, by having specialized servers setup for static content. You will need maybe a lookup server, to know for each file on what server can be found. It will also contain the logic to determine what is the best server to use to save the file. This is more complicated, as you will have to make the load balancing and take care of geographic location of users.

Should I save images in database or in a folder?

I want to store some photos that I take from a web service to my phone for the case when I don't have internet connectivity. I am storing data to a database but i have a question: should I store in the database the URL of the photo and the photo in a folder, or store the image in the database? The volume of photos shouldn't be great; something like 200-300 small pics, at approx 30-40kB each.
If you already have a database, i would organize my photos in database with only the path to the photo. And the photos can be stored on memorycard or on local disk.
The basic rule of thumb is to put big data objects like images right onto the disk and only reference the URLs. This might come in handy for loading/processing the images anyway.
30-40 kB per image is not that much, but then I'd consider 6-12 MB for the database quite extensive, especially it's probably the majority of your database volume.
I'm not real familiar with iOS. But my understanding is that it supports XML files. If the database is just being used to store the paths (instead of images), why not use an xml file to store the paths?
If you need the db, with small images, I don't see it being a problem if the phone is just using it. Either way, I don't think it'll be an issue. Someone else can probably give you a better answer as far as efficiency. That's outside my jurisdiction.
Store all the pics in document folder, and when there is no internet connection get them from document folder of your iPhone.