Simple question: I want to upload/download large files around via REST. What is the best practice to do that? Are there any chunk-patterns, do I use multipart on the transport layer, what do you recommend?
Use case: we have an API where you can upload payments (e.g. 500mb) and download large account statement files. I am aware that other protocols exist to do that but how is it done with REST?
see the answers here. They might help with your problem:
REST design for file uploads
Large file upload though html form (more than 2 GB)
In conclusion:
With REST you can simply use HTTP header fields to specify the content size, e.g use the Content-Type multipart/form-data in your request for files up to the server limit (usually 2GB - 4GB) and for files larger than that you will have to split the request in multiple parts.
Also check out this answer to see if byte-serving or chunked encoding makes sense in your application:
Content-Length header versus chunked encoding
Related
I use Potoswipe in my project. Potoswipe requires the definition of the images size. In a time when I used PHP that was a simple, nowadays in the serverless time it looks like an issue.
How can I get an image size on a client? I’ve checked few cloud storages providers they all are offering resizing but it is not exactly what I’m looking for, I need a full image property before upload image when I initialize the App.
The easiest thing is to find a hosting service with an API that provides the metadata you need.
Failing that, you can work out the dimensions of an image by looking at just the first few dozen bytes. All modern image formats contain dimensions in a file header. Examples in three languages are linked to in the [Photoswipe FAQ][1].
So, you need to download enough bytes (this varies per file format). To do this you need to use the Range header in the HTTP request. For example:
curl http://i.imgur.com/z4d4kWk.jpg -i -H "Range: bytes=0-1023"
... will get the first kilobyte of that image. Whatever HTTP client you use, it will have a way to set a request header.
I want to implement a REST API for file upload, which has the following requirements:
Support for resumable and chunked uploads
Support for adding and editing metadata after upload, like tags or descriptions
Be resource friendly, so uploads should use raw data, i.e. no encoding and / or encapsulation
So, requirements 1) and 3) already rule out use of multipart forms (I don't like this approach anyways).
I already have a working solution to satisfy requirement 1), by issuing a POST request first, which transmits fileName, fileSize and fileModification date as JSON. The upload module then creates a temporary file (if none exists!) and responds with a 206, containing an upload token in the header, which in turn is a hash created over the data sent in the POST. This token has to be used for the actual upload. The response also contains a byte range which instructs the client what part of the file it has to upload. The advantage is that by this way, it is detectable if a former partial upload already exists - the trick is to name the temporary file the same as the token. So the byte range might also start with another value than 0-. All the user has to do to resume an interupted upload is to upload the same file again!
The actual upload is done via PUT, with the message body only containing the raw binary data. The response then returns the metadata of the created file as JSON, or responds with another 206 containing an updated byte range header if the upload is still incomplete. So by this way, also chunked uploads are possible. All in all I like this solution and it works well. At least I see no other way to implement resumable uploads without a 2 stage approach.
Now, my problem is to make this truly "RESTish".
Lets say we have a collection named /files where I want to upload files. Of course, POST /files seems to be the natural way to do so. But then, we would have a subsequent PUT to the same collection which is of course not REST compatible in my eyes. Next approach would be that the initial POST returns a new URL that points to the final resource: /files/{fileId}and the subsequent PUT(s) write to this resource instead.
That feels more "RESTish", but is not as straightforward and then it may be possible that there are incomplete file resources floating around when upload is not completed. I think the actual resource should only be created if the upload is complete. Furthermore, if I want to update / add metadata later on, that would be a PUT to the same resource, but the request itself would be quite different. Hmmm.
I could change the PUT to a POST, so the upload would consist of multiple POSTs but this smells fishy as well.
I also thought of splitting the resource in two subresources, like /files/{fileId}/metadata and /files/{fileId}/binary but this feels a bit "overengineered" and also requires the file resource to be created before the upload is complete.
Any ideas how to solve this in a better way?
I know that the title is not that correct, but i don't know how to name this problem...
Currently I'm trying to design my first REST-API for a conversion-service. Therefore the user has an input file which is given to the server and gets back the converted file.
The current problem I've got is, that the converted file should be accessed with a simple GET /conversionservice/my/url. However it is not possible to upload the input file within GET-Request. A POST would be necessary (am I right?), but POST isn't cacheable.
Now my question is, what's the right way to design this? I know that it could be possible to upload the input file before to the server and then access it with my GET-Request, but those input files could be everything!
Thanks for your help :)
A POST request is actually needed for a file upload. The fact that it is not cachable should not bother the service because how could any intermediaries (the browser, the server, proxy etc) know about the content of the file. If you need cachability, you would have to implement it yourself probably with a hash (md5, sha1 etc) of the uploaded file. This would keep you from having to perform the actual conversion twice, but you would have to hash each file that was uploaded which would slow you down for a "cache miss".
The only other way I could think of to solve the problem would be to require the user to pass in an accessible url to the file in the query string, then you could handle GET requests, but your users would have to make the file accessible over the internet. This would allow caching but limit the usability.
Perhaps a hybrid approach would be possible where you accepted a POST for a file upload and a GET for a url, this would increase the complexity of the service but maximize usability.
Also, you should look into what caches you are interested in leveraging as a lot of them have limits on the size of a cache entry meaning if the file is sufficiently large it would not cache anyway.
In the end, I would advise you to stick to the standards already established. Accept the POST request for the file upload and if you are interested in speeding up the user experience maybe make the upload persist, this would allow the user to upload a file once and download it in many different formats.
You sequence of events can be as follows:
Upload your file/files using POST. For immediate response, you can return required information using your own headers. (It should return document key to access the file for future use.)
Then you can use GET for further operations using the above mentioned document key as a query string.
Hi I'm trying to realize a Tornado server with the goal to receive very big binary files (~1GB) into POST body. The following code works for small files, but does not answer if I try to send big files (~100MB).
class ReceiveLogs(tornado.web.RequestHandler):
def post(self):
file1 = self.request.body
output_file = open('./output.zip', 'wb')
output_file.write(file1)
output_file.close()
self.finish("file is uploaded")
Do you know any solutions?
I don't have a real implementation as an answer but one or two remarks that hopefully point to the right direction.
First of all there is a 100MB Upload limit which can be increased setting the
self.request.connection.set_max_body_size(size)
in the initalization of the Request handler. (taken from this answer)
The Problem is that tornado handles all file uploads in memory (and that HTTP is not a very reliable Protocol for handling large file uploads.)
This is quote from a member of the tornadoweb team from 2014 (see github issue here)
... You can adjust this limit with the max_buffer_size argument to the
HTTPServer constructor, although I don't think it would be a good idea
to set this larger than say 100MB.
Tornado does not currently support very large file uploads. Better
support is coming (#1021) and the nginx upload module is a popular
workaround in the meantime. However, I would advise against doing 1GB+
uploads in a single HTTP POST in any case, because HTTP alone does not
have good support for resuming a partially-completed upload (in
addition to the aforementioned error problem). Consider a multi-step
upload process like Dropbox's chunked_upload and commit_chunked_upload
(https://www.dropbox.com/developers/core/docs#chunked-upload)
As stated I would recommend to do one of the following:
if NGNIX is possible to handle and route requests to tornado=> look
at the NGNIX upload module (see ngnix wiki here)
If it must be a plain tornado solution use the
tornado.web.stream_request_body which came with tornado 4. This
streams the uploaded files to disk instead of trying to first get
them all in mem. (see tornado 4 release notes and this solution on stackoverflow)
I have a large byte file (log file) that I want to upload to server using PUT request. The reason I choose PUT is simply because I can use it to create a new resource or update an existing resource.
My problem is how to handle situation when server or Network disruption happens during PUT request.
That is say I have a huge file, during the transfer of which, Network failure happens. When the network resumes, I dont want to start the entire upload. How would I handle this?
I am using JAX-RS API with RESTeasy implementation.
Some people are using the Content-Range Header to achieve this but many people (like Mark Nottingham) state that this is not legal for requests. Please read the comments to this answer.
Besides there is no support from JAX-RS for this scenario.
If you really have the repeating problem of broken PUT requests I would simply let the client slice the files:
PUT /logs/{id}/1
PUT /logs/{id}/2
PUT /logs/{id}/3
GET /logs/{id} would then return the aggregation of all successful submitted slices.