REST - GET-Respone with temporary uploaded File - rest

I know that the title is not that correct, but i don't know how to name this problem...
Currently I'm trying to design my first REST-API for a conversion-service. Therefore the user has an input file which is given to the server and gets back the converted file.
The current problem I've got is, that the converted file should be accessed with a simple GET /conversionservice/my/url. However it is not possible to upload the input file within GET-Request. A POST would be necessary (am I right?), but POST isn't cacheable.
Now my question is, what's the right way to design this? I know that it could be possible to upload the input file before to the server and then access it with my GET-Request, but those input files could be everything!
Thanks for your help :)

A POST request is actually needed for a file upload. The fact that it is not cachable should not bother the service because how could any intermediaries (the browser, the server, proxy etc) know about the content of the file. If you need cachability, you would have to implement it yourself probably with a hash (md5, sha1 etc) of the uploaded file. This would keep you from having to perform the actual conversion twice, but you would have to hash each file that was uploaded which would slow you down for a "cache miss".
The only other way I could think of to solve the problem would be to require the user to pass in an accessible url to the file in the query string, then you could handle GET requests, but your users would have to make the file accessible over the internet. This would allow caching but limit the usability.
Perhaps a hybrid approach would be possible where you accepted a POST for a file upload and a GET for a url, this would increase the complexity of the service but maximize usability.
Also, you should look into what caches you are interested in leveraging as a lot of them have limits on the size of a cache entry meaning if the file is sufficiently large it would not cache anyway.
In the end, I would advise you to stick to the standards already established. Accept the POST request for the file upload and if you are interested in speeding up the user experience maybe make the upload persist, this would allow the user to upload a file once and download it in many different formats.

You sequence of events can be as follows:
Upload your file/files using POST. For immediate response, you can return required information using your own headers. (It should return document key to access the file for future use.)
Then you can use GET for further operations using the above mentioned document key as a query string.

Related

REST File upload with resume and metadata support

I want to implement a REST API for file upload, which has the following requirements:
Support for resumable and chunked uploads
Support for adding and editing metadata after upload, like tags or descriptions
Be resource friendly, so uploads should use raw data, i.e. no encoding and / or encapsulation
So, requirements 1) and 3) already rule out use of multipart forms (I don't like this approach anyways).
I already have a working solution to satisfy requirement 1), by issuing a POST request first, which transmits fileName, fileSize and fileModification date as JSON. The upload module then creates a temporary file (if none exists!) and responds with a 206, containing an upload token in the header, which in turn is a hash created over the data sent in the POST. This token has to be used for the actual upload. The response also contains a byte range which instructs the client what part of the file it has to upload. The advantage is that by this way, it is detectable if a former partial upload already exists - the trick is to name the temporary file the same as the token. So the byte range might also start with another value than 0-. All the user has to do to resume an interupted upload is to upload the same file again!
The actual upload is done via PUT, with the message body only containing the raw binary data. The response then returns the metadata of the created file as JSON, or responds with another 206 containing an updated byte range header if the upload is still incomplete. So by this way, also chunked uploads are possible. All in all I like this solution and it works well. At least I see no other way to implement resumable uploads without a 2 stage approach.
Now, my problem is to make this truly "RESTish".
Lets say we have a collection named /files where I want to upload files. Of course, POST /files seems to be the natural way to do so. But then, we would have a subsequent PUT to the same collection which is of course not REST compatible in my eyes. Next approach would be that the initial POST returns a new URL that points to the final resource: /files/{fileId}and the subsequent PUT(s) write to this resource instead.
That feels more "RESTish", but is not as straightforward and then it may be possible that there are incomplete file resources floating around when upload is not completed. I think the actual resource should only be created if the upload is complete. Furthermore, if I want to update / add metadata later on, that would be a PUT to the same resource, but the request itself would be quite different. Hmmm.
I could change the PUT to a POST, so the upload would consist of multiple POSTs but this smells fishy as well.
I also thought of splitting the resource in two subresources, like /files/{fileId}/metadata and /files/{fileId}/binary but this feels a bit "overengineered" and also requires the file resource to be created before the upload is complete.
Any ideas how to solve this in a better way?

remove google cache of thousand's of pages

Webmaster's tool have cache removal option but it allows to enter only one URL. I want to remove cache of around 50k url's. It's tedious job to do it 50k times.
Address of URL's is like following:
Say user profiles are being cached.
URL's are like "myinfo.com/profiles/1", "myinfo.com/profiles/2", "myinfo.com/profiles/3", "myinfo.com/profiles/4" and so on.
If I enter relative path http://myinfo.com/profiles/, will google remove cache of all profiles?
Or Is there any way to submit url's in bulk?
Since you've already figured out how to request removal of a single URL, that means it is out of the question here.
So, the next step in that process is it lets you request removal of an entire directory, if the file URLs are predictable in that particular manner. (Since you mentioned you have thousands of URLs, I'd hope they're at least somewhat organized.) So, try it out this way once.

How to Update a resource with a large attachment with PUT request in JAX-RS?

I have a large byte file (log file) that I want to upload to server using PUT request. The reason I choose PUT is simply because I can use it to create a new resource or update an existing resource.
My problem is how to handle situation when server or Network disruption happens during PUT request.
That is say I have a huge file, during the transfer of which, Network failure happens. When the network resumes, I dont want to start the entire upload. How would I handle this?
I am using JAX-RS API with RESTeasy implementation.
Some people are using the Content-Range Header to achieve this but many people (like Mark Nottingham) state that this is not legal for requests. Please read the comments to this answer.
Besides there is no support from JAX-RS for this scenario.
If you really have the repeating problem of broken PUT requests I would simply let the client slice the files:
PUT /logs/{id}/1
PUT /logs/{id}/2
PUT /logs/{id}/3
GET /logs/{id} would then return the aggregation of all successful submitted slices.

Best DB solution for storing large files

I must provide a solution where user can upload files and they must be stored together with some metadata, and this may grow really big.
Access to these files must be controlled, so they want me to just store them in DB BLOBs, but I fear PostgreSQL won't handle it properly over time.
My first idea was use some NoSQL DB solution, but I couldn't find any that would replace a good RDBMS and elegantly store files together. Then I thought on just saving these files in HD somewhere WebServer won't serve them, name them their table ID, and just load them on RAM and print them with proper content-type.
Could anyone suggest me any better solution for this?
I had the requirement to store many images (with some meta data) and allow controlled access to them, here is what I did.
To the cloud™
I save the image files in Amazon S3. My local database holds the metadata with the S3 location of the file as one column. When an authenticated and authorized user needs to see the file they hit a URL in my system (where the authentication and authorization checks occur) which then generates a pre-signed, expiring URL for the image and sends a redirect back to the browser. The browser is then able to load the image for a given amount of time (as specified in the signature within the URL.)
With this solution I have user level access to the resources and I don't have to store them as BLOBs or anything like that which may grow unwieldy over time. I also don't use MY bandwidth to stream the files to the client and get cheap, redundant storage for them. Obviously the suitability of this solution will depend on the nature of the binary files you are looking to store and your level of trust in Amazon. The world doesn't end if there is a slip and someone sees an image from my system they shouldn't. YMMV.

Amazon S3 - direct upload, hide action URL that user can't see it

We use Amazon S3 for storing large files, so we use direct upload from user's browser as described here: http://aws.amazon.com/articles/1434
My question is:
could I somehow hide form's action URL so user won't be able to find out where the file is being uploaded? Would it be 100% hiden or could I just at least make it harder for experienced users to find it?
Thanks.
Well, you need to give the information to the user so that they can use it to upload to S3... the only way to hide it would be to have them POST to your server, where you then re-POST it to S3, but that defeats the purpose doesn't it?
What's the concern, exactly? The document you linked shows that you have to sign everything so they know that only you could've made that form... so it's not like anybody can get at your data, anyway.
You can't hide it since the whole point is going from their computer directly to s3, the fastest way possible. Even if you removed it from the code and had your form submit button request a url to use, then hide, it will just be in console.log's network activity. If you have to use s3, you maybe should install a faster uploader(or write one in c? haha) than node if you want to run it through your server. which begs the question, what language/framework can upload to s3 the fastest?