Processing large downloads with Spray - scala

I'd like to GET a potentially large file with Spray and process it incrementally, rather than loading the whole response entity into memory at once. (Specifically, to process a CSV file line by line.) The request will be to an arbitrary server, so I can't expect a chunked response. Is this possible?

If you set spray.can.client.parsing.incoming-auto-chunking-threshold-size to some finite value, entities bigger than that will be delivered in chunks. See here: https://github.com/spray/spray/blob/master/spray-can/src/main/resources/reference.conf#L372
See this ticket for an overview of features wrt chunked and streaming facilities in spray: https://github.com/spray/spray/issues/281

Related

What REST method should be used to implement a simple inter-process communication?

This is more a theorical question than a practical one.
We have a backend application that uploads csv files to a frontend application, then and only then the backend sends an empty POST request to tell the frontend to start to process those files to update its database.
For this question it doesn't matter if this is a good design (I think it isn't), what are those files, and what database is: I am only want to know better about the REST "sintax".
I'm referring to wikipedia and restfulapi.net, but I'm not convinced about any alternative, because:
GET: Request sender doesn't receive data;
POST (the currently used): Request sender doesn't want to insert data that are on the request body (just data from external files, if existent. Also they can be insert/update/delete);
PUT: Sounds good, but again, data are not on the request body;
PATCH: Sounds best, but data are not on the body (Also, I am wrong or is it deprecated/unused?);
DELETE: Doesn't always need to delete.
I know it is habit to use POST requests to let machines yell "go!" to each other, but I never thought it was right.
What do you think - in theory - would be the proper method?
The actual reference for the semantics of the HTTP methods is the RFC 7231 and not the ones you referenced in your question.
POST is a catch all method and requests that the target resource process the representation enclosed in the request according to the resource's own specific semantics.
4.3.3. POST
The POST method requests that the target resource process the
representation enclosed in the request according to the resource's
own specific semantics. For example, POST is used for the following
functions (among others):
Providing a block of data, such as the fields entered into an HTML
form, to a data-handling process;
Posting a message to a bulletin board, newsgroup, mailing list,
blog, or similar group of articles;
Creating a new resource that has yet to be identified by the
origin server; and
Appending data to a resource's existing representation(s).
[...]
Responses to POST requests are only cacheable when they include
explicit freshness information. However, POST caching is not widely implemented.
In these scenarios, the receiving application knows where the CSV files will be and monitors that location. When it finds one, it processes it and then deletes or archives it. The application will likely have its own criteria for considering itself ready to process, e.g. time of day, size of file etc.
If the data load on the front end takes a long time you could "partition" the updates based on "importance". How you define importance would be up to your business rules. You could then POST a list of CSV filenames/locations to the front end. The list would be ordered by importance. The front end could then update its database based on that importance. Scheduling less important data for a more appropriate time of day.
If the backend knows the difference between new users and updated users you could use PUT and POST. The front end could assign higher priority to PUT requests as they relate to new users, perhaps assigning lower priority and staggered syncing for CSV filenames in POST requests.

REST API containing POST and PUT/PATCH calling a compute server generating results files

The server application I'm implementing generates calculation results and stores these in result files in directories on the server. For example, customer/project/scenario/resultfiles. I want to design and implement a resilient REST implementation to retrieve the result files for display in the client browser, delete results files, customers etc and to create result files within a scenario for calculation parameters sent to the server. And possibly to do sensitivity analysis to generate result files within a scenario by varying calculation parameters.
I can use GET to retrieve these files using a URL with query string appname/?customerId=xxx&projectId=xxx etc And DELETE on the directory structure and files also using query strings. What I'm unclear about is the best REST approach to call functions implementing various calculations on the server.
Perhaps this should be a POST for the initial calculation in a scenario as this is creating the results files? Maybe a PUT or a PATCH for the sensitivity analysis or other partial recalculations as this is modifying results in an existing scenario?
There's a fair bit of online discussion about PUT vs PATCH vs POST used for database related activities. I could work up a REST approach based on what I've read for REST database interactions but if there's already standard practice on how to do calculations through a REST API I'd rather use that.
Perhaps this should be a POST for the initial calculation in a scenario as this is creating the results files? Maybe a PUT or a PATCH for the sensitivity analysis or other partial recalculations as this is modifying results in an existing scenario?
You can always just use POST. If we were using HTML representations of resources to guide the client through the protocol, we'd be doing that by following links and submitting forms. In HTML, submitting forms is limited to GET and POSt.
PUT and PATCH have more tightly constrained semantics than POST. Specifically, they are methods that request that the server make its representation match the clients representation (for PUT, we send the entire replacement representation; for PATCH, we just send the changes made by the client).
Technically, there's nothing wrong with the server not accepting the offered edits as is:
A successful PUT of a given representation would suggest that a subsequent GET on that same target resource will result in an equivalent representation being sent in a 200 (OK) response. However, there is no guarantee that such a state change will be observable, since the target resource might be acted upon by other user agents in parallel, or might be subject to dynamic processing by the origin server, before any subsequent GET is received. A successful response only implies that the user agent's intent was achieved at the time of its processing by the origin server.
So the server could accept the client's edits, and then immediately apply additional edits of its own.

Limit WsClient download size

I'm performing an url parsing with WsClient. However, I don't want to parse a remote resource that contain large amount of data (for example, a url that points to a video).
Is there a built-in option to set the remote content limit in WsClient?
Is it possible to do it without Akka Stream? The difficulty with using stream is that it provides a ByteString, so there is a content encoding headache (utf8, cp1251, etc).

best approach to design a rest web service with binary data to be consumed from the browser

I'm developing a json rest web service that will be consumed from a single web page app built with backbone.js
This API will let the consumer upload files related to some entity, like pdf reports related to a project
Googling around and doing some research at stack overflow I came with these possible approaches:
First approach: base64 encoded data field
POST: /api/projects/234/reports
{
author: 'xxxx',
abstract: 'xxxx',
filename: 'xxxx',
filesize: 222,
content: '<base64 encoded binary data>'
}
Second approach: multipart form post:
POST: /api/projects/234/reports
{
author: 'xxxx',
abstract: 'xxxx',
}
as a response I'll get a report id, and with that I shall issue another post
POST: /api/projects/234/reports/1/content
enctype=multipart/form-data
and then just send the binary data
(have a look at this: https://stackoverflow.com/a/3938816/47633)
Third approach: post the binary data to a separate resource and save the href
first I generate a random key at the client and post the binary content there
POST: /api/files/E4304205-29B7-48EE-A359-74250E19EFC4
enctype=multipart/form-data
and then
POST: /api/projects/234/reports
{
author: 'xxxx',
abstract: 'xxxx',
filename: 'xxxx',
filesize: 222,
href: '/api/files/E4304205-29B7-48EE-A359-74250E19EFC4'
}
(see this: https://stackoverflow.com/a/4032079/47633)
I just wanted to know if there's any other approach I could use, the pros/cons of each, and if there's any established way to deal with this kind of requirements
the big con I see to the first approach, is that I have to fully load and base64 encode the file on the client
some useful resources:
Post binary data to a RESTful application
What is a good way to transfer binary data to a HTTP REST API service?
How do I upload a file with metadata using a REST web service?
Bad idea to transfer large payload using web services?
https://stackoverflow.com/a/5528267/47633
My research results:
Single request (data included)
The request contains metadata. The data is a property of metadata and encoded (for example: Base64).
Pros:
transactional
everytime valid (no missing metadata or data)
Cons:
encoding makes the request very large
Examples:
Twitter
GitHub
Imgur
Single request (multipart)
The request contains one or more parts with metadata and data.
Content types:
multipart/form-data
multipart/mixed
multipart/related
Pros:
transactional
everytime valid (no missing metadata or data)
Cons:
content type negotiation is complex
content type for data is not visible in WADL
Examples:
Confluence (with parts for data and for metadata)
Jira (with one part for data, metadata only part headers for file name and mime type)
Bitbucket (with one part for data, no metadata)
Google Drive (with one part for metadata and one for part data)
Single request (metadata in HTTP header and URL)
The request body contains the data and the HTTP header and the URL contains the metadata.
Pros:
transactional
everytime valid (no missing metadata or data)
Cons:
no nested metadata possible
Examples:
S3 GetObject and PutObject
Two request
One request for metadata and one or more requests for data.
Pros:
scalability (for example: data request could go to repository server)
resumable (see for example Google Drive)
Cons:
not transactional
not everytime valid (before second request, one part is missing)
Examples:
Google Drive
YouTube
I can't think of any other approaches off the top of my head.
Of your 3 approaches, I've worked with method 3 the most. The biggest difference I see is between the first method and the other 2: Separating metadata and content into 2 resources
Pro: Scalability
while your solution involves posting to the same server, this can easily be changed to point the content upload to a separate server (i.e. Amazon S3)
In the first method, the same server that serves metadata to users will have a process blocked by a large upload.
Con: Orphaned Data/Added complexity
failed uploads (either metadata or content) will leave orphaned data in the server DB
Orphaned data can be cleaned up with a scheduled job, but this adds code complexity
Method II reduces the orphan possibilities, at the cost of longer client wait time as you're blocking on the response of the first POST
The first method seems the most straightforward to code. However, I'd only go with the first method if anticipate this service being used infrequently and you can set a reasonable limit on the user file uploads.
I believe the ultimate method is number 3 (separate resource) for the main reason that it allows maximizing the value I get from the HTTP standard, which matches how I think of REST APIs. For example, and assuming a well-grounded HTTP client is in the use, you get the following benefits:
Content compression: You optimize by allowing servers to respond with compressed result if clients indicate they support, your API is unchanged, existing clients continue to work, future clients can make use of it
Caching: If-Modified-Since, ETag, etc. Clients can advoid refetching the binary data altogether
Content type abstraction: For example, you require an uploaded image, it can be of types image/jpeg or image/png. The HTTP headers Accept and Content-type give us some elegant semantics for negotiating this between clients and servers without having to hardcode it all as part of our schema and/or API
On the other hand, I believe it's fair to conclude that this method is not the simplest if the binary data in question is not optional. In which case the Cons listed in Eric Hu's answer will come into play.

RESTful way to create multiple items in one request

I am working on a small client server program to collect orders. I want to do this in a "REST(ful) way".
What I want to do is:
Collect all orderlines (product and quantity) and send the complete order to the server
At the moment I see two options to do this:
Send each orderline to the server: POST qty and product_id
I actually don't want to do this because I want to limit the number of requests to the server so option 2:
Collect all the orderlines and send them to the server at once.
How should I implement option 2? a couple of ideas I have is:
Wrap all orderlines in a JSON object and send this to the server or use an array to post the orderlines.
Is it a good idea or good practice to implement option 2, and if so how should I do it.
What is good practice?
I believe that another correct way to approach this would be to create another resource that represents your collection of resources.
Example, imagine that we have an endpoint like /api/sheep/{id} and we can POST to /api/sheep to create a sheep resource.
Now, if we want to support bulk creation, we should consider a new flock resource at /api/flock (or /api/<your-resource>-collection if you lack a better meaningful name). Remember that resources don't need to map to your database or app models. This is a common misconception.
Resources are a higher level representation, unrelated with your data. Operating on a resource can have significant side effects, like firing an alert to a user, updating other related data, initiating a long lived process, etc. For example, we could map a file system or even the unix ps command as a REST API.
I think it is safe to assume that operating a resource may also mean to create several other entities as a side effect.
Although bulk operations (e.g. batch create) are essential in many systems, they are not formally addressed by the RESTful architecture style.
I found that POSTing a collection as you suggested basically works, but problems arise when you need to report failures in response to such a request. Such problems are worse when multiple failures occur for different causes or when the server doesn't support transactions.
My suggestion to you is that if there is no performance problem, for example when the service provider is on the LAN (not WAN) or the data is relatively small, it's worth it to send 100 POST requests to the server. Keep it simple, start with separate requests and if you have a performance problem try to optimize.
Facebook explains how to do this: https://developers.facebook.com/docs/graph-api/making-multiple-requests
Simple batched requests
The batch API takes in an array of logical HTTP requests represented
as JSON arrays - each request has a method (corresponding to HTTP
method GET/PUT/POST/DELETE etc.), a relative_url (the portion of the
URL after graph.facebook.com), optional headers array (corresponding
to HTTP headers) and an optional body (for POST and PUT requests). The
Batch API returns an array of logical HTTP responses represented as
JSON arrays - each response has a status code, an optional headers
array and an optional body (which is a JSON encoded string).
Your idea seems valid to me. The implementation is a matter of your preference. You can use JSON or just parameters for this ("order_lines[]" array) and do
POST /orders
Since you are going to create more resources at once in a single action (order and its lines) it's vital to validate each and every of them and save them only if all of them pass validation, ie. you should do it in a transaction.
I've actually been wrestling with this lately, and here's what I'm working towards.
If a POST that adds multiple resources succeeds, return a 200 OK (I was considering a 201, but the user ultimately doesn't land on a resource that was created) along with a page that displays all resources that were added, either in read-only or editable fashion. For instance, a user is able to select and POST multiple images to a gallery using a form comprising only a single file input. If the POST request succeeds in its entirety the user is presented with a set of forms for each image resource representation created that allows them to specify more details about each (name, description, etc).
In the event that one or more resources fails to be created, the POST handler aborts all processing and appends each individual error message to an array. Then, a 419 Conflict is returned and the user is routed to a 419 Conflict error page that presents the contents of the error array, as well as a way back to the form that was submitted.
I guess it's better to send separate requests within single connection. Of course, your web-server should support it
You won't want to send the HTTP headers for 100 orderlines. You neither want to generate any more requests than necessary.
Send the whole order in one JSON object to the server, to: server/order or server/order/new.
Return something that points to: server/order/order_id
Also consider using CREATE PUT instead of POST