Limit WsClient download size - scala

I'm performing an url parsing with WsClient. However, I don't want to parse a remote resource that contain large amount of data (for example, a url that points to a video).
Is there a built-in option to set the remote content limit in WsClient?
Is it possible to do it without Akka Stream? The difficulty with using stream is that it provides a ByteString, so there is a content encoding headache (utf8, cp1251, etc).

Related

REST API containing POST and PUT/PATCH calling a compute server generating results files

The server application I'm implementing generates calculation results and stores these in result files in directories on the server. For example, customer/project/scenario/resultfiles. I want to design and implement a resilient REST implementation to retrieve the result files for display in the client browser, delete results files, customers etc and to create result files within a scenario for calculation parameters sent to the server. And possibly to do sensitivity analysis to generate result files within a scenario by varying calculation parameters.
I can use GET to retrieve these files using a URL with query string appname/?customerId=xxx&projectId=xxx etc And DELETE on the directory structure and files also using query strings. What I'm unclear about is the best REST approach to call functions implementing various calculations on the server.
Perhaps this should be a POST for the initial calculation in a scenario as this is creating the results files? Maybe a PUT or a PATCH for the sensitivity analysis or other partial recalculations as this is modifying results in an existing scenario?
There's a fair bit of online discussion about PUT vs PATCH vs POST used for database related activities. I could work up a REST approach based on what I've read for REST database interactions but if there's already standard practice on how to do calculations through a REST API I'd rather use that.
Perhaps this should be a POST for the initial calculation in a scenario as this is creating the results files? Maybe a PUT or a PATCH for the sensitivity analysis or other partial recalculations as this is modifying results in an existing scenario?
You can always just use POST. If we were using HTML representations of resources to guide the client through the protocol, we'd be doing that by following links and submitting forms. In HTML, submitting forms is limited to GET and POSt.
PUT and PATCH have more tightly constrained semantics than POST. Specifically, they are methods that request that the server make its representation match the clients representation (for PUT, we send the entire replacement representation; for PATCH, we just send the changes made by the client).
Technically, there's nothing wrong with the server not accepting the offered edits as is:
A successful PUT of a given representation would suggest that a subsequent GET on that same target resource will result in an equivalent representation being sent in a 200 (OK) response. However, there is no guarantee that such a state change will be observable, since the target resource might be acted upon by other user agents in parallel, or might be subject to dynamic processing by the origin server, before any subsequent GET is received. A successful response only implies that the user agent's intent was achieved at the time of its processing by the origin server.
So the server could accept the client's edits, and then immediately apply additional edits of its own.

Tarantool shiny dashboard

I want to use Tarantool database for logging user activity.
Are there any out of the box solutions to create web dashboard with nice charts based on the collected data?
A long time ago, using an old-old version of tarantool I've created a draft of tarbon - time-series database, with carbon-cache identical interface.
Since that time the protocol have changed, but the generic idea still the same: use spaces to store data, compact data organization and correct indexes to access spaces as time-series rows and lua for preparing resulting jsons.
That solution was perfect in performance (either on reads or on writes), but that old version lacks disk storage and without disk I was very limited to metrics capacity.
Tarantool has embedded lua language so u could generate json from your data and use any charting library. For example D3.js has method to load json directly from url.
d3.json(url[, callback])
Creates a request for the JSON file at the specified url with the mime type "application/json". If a callback is specified, the request is immediately issued with the GET method, and the callback will be invoked asynchronously when the file is loaded or the request fails; the callback is invoked with two arguments: the error, if any, and the parsed JSON. The parsed JSON is undefined if an error occurs. If no callback is specified, the returned request can be issued using xhr.get or similar, and handled using xhr.on.
You also could look at c3.js simple facade for d3

Processing large downloads with Spray

I'd like to GET a potentially large file with Spray and process it incrementally, rather than loading the whole response entity into memory at once. (Specifically, to process a CSV file line by line.) The request will be to an arbitrary server, so I can't expect a chunked response. Is this possible?
If you set spray.can.client.parsing.incoming-auto-chunking-threshold-size to some finite value, entities bigger than that will be delivered in chunks. See here: https://github.com/spray/spray/blob/master/spray-can/src/main/resources/reference.conf#L372
See this ticket for an overview of features wrt chunked and streaming facilities in spray: https://github.com/spray/spray/issues/281

What is the maximum size of JWT token?

I need to know the maximum length of
JSON Web Token (JWT)
In specs there are no information about it. Could be that, there are no limitations in length ?
I've also been trying to find this.
I'd say - try and ensure it's below 7kb.
Whilst JWT defines no upper limit in the spec (http://www.rfc-editor.org/rfc/rfc7519.txt) we do have some operational limits.
As a JWT is included in a HTTP header, we've an upper limit (SO: Maximum on http header values) of 8K on the majority of current servers.
As this includes all Request headers < 8kb, with 7kb giving a reasonable amount of room for other headers. The biggest risk to that limit would be cookies (sent in headers and can get large).
As it's encrypted and base64ed there's at least 33% wastage of the original json string, so do check the length of the final encrypted token.
One final point - proxies and other network appliances may apply an abitrary limit along the way...
As you said, there is no maximum length defined in the RFC7519 (https://www.rfc-editor.org/rfc/rfc7519) or other RFCs related to JWS or JWE.
If you use the JSON Serialized format or JSON Flattened Serialized format, there is no limitation and there is no reason to define a limitation.
But if you use the JSON Compact Serialized format (most common format), you have to keep in mind that it should be as short as possible because it is mainly used in a web context. A 4kb JWT is something that you should avoid.
Take care to store only useful claims and header informations.
When using heroku the header will be limited at 8k. Depending of how much data are you using on jwt2 it will be reach. The request, when oversize, will not touch your node instance, heroku router will drop it before your API layer..
When processing an incoming request, a router sets up an 8KB receive
buffer and begins reading the HTTP request line and request headers.
Each of these can be at most 8KB in length, but together can be more
than 8KB in total. Requests containing a request line or header line
longer than 8KB will be dropped by the router without being
dispatched.
See: Heroku Limits

best approach to design a rest web service with binary data to be consumed from the browser

I'm developing a json rest web service that will be consumed from a single web page app built with backbone.js
This API will let the consumer upload files related to some entity, like pdf reports related to a project
Googling around and doing some research at stack overflow I came with these possible approaches:
First approach: base64 encoded data field
POST: /api/projects/234/reports
{
author: 'xxxx',
abstract: 'xxxx',
filename: 'xxxx',
filesize: 222,
content: '<base64 encoded binary data>'
}
Second approach: multipart form post:
POST: /api/projects/234/reports
{
author: 'xxxx',
abstract: 'xxxx',
}
as a response I'll get a report id, and with that I shall issue another post
POST: /api/projects/234/reports/1/content
enctype=multipart/form-data
and then just send the binary data
(have a look at this: https://stackoverflow.com/a/3938816/47633)
Third approach: post the binary data to a separate resource and save the href
first I generate a random key at the client and post the binary content there
POST: /api/files/E4304205-29B7-48EE-A359-74250E19EFC4
enctype=multipart/form-data
and then
POST: /api/projects/234/reports
{
author: 'xxxx',
abstract: 'xxxx',
filename: 'xxxx',
filesize: 222,
href: '/api/files/E4304205-29B7-48EE-A359-74250E19EFC4'
}
(see this: https://stackoverflow.com/a/4032079/47633)
I just wanted to know if there's any other approach I could use, the pros/cons of each, and if there's any established way to deal with this kind of requirements
the big con I see to the first approach, is that I have to fully load and base64 encode the file on the client
some useful resources:
Post binary data to a RESTful application
What is a good way to transfer binary data to a HTTP REST API service?
How do I upload a file with metadata using a REST web service?
Bad idea to transfer large payload using web services?
https://stackoverflow.com/a/5528267/47633
My research results:
Single request (data included)
The request contains metadata. The data is a property of metadata and encoded (for example: Base64).
Pros:
transactional
everytime valid (no missing metadata or data)
Cons:
encoding makes the request very large
Examples:
Twitter
GitHub
Imgur
Single request (multipart)
The request contains one or more parts with metadata and data.
Content types:
multipart/form-data
multipart/mixed
multipart/related
Pros:
transactional
everytime valid (no missing metadata or data)
Cons:
content type negotiation is complex
content type for data is not visible in WADL
Examples:
Confluence (with parts for data and for metadata)
Jira (with one part for data, metadata only part headers for file name and mime type)
Bitbucket (with one part for data, no metadata)
Google Drive (with one part for metadata and one for part data)
Single request (metadata in HTTP header and URL)
The request body contains the data and the HTTP header and the URL contains the metadata.
Pros:
transactional
everytime valid (no missing metadata or data)
Cons:
no nested metadata possible
Examples:
S3 GetObject and PutObject
Two request
One request for metadata and one or more requests for data.
Pros:
scalability (for example: data request could go to repository server)
resumable (see for example Google Drive)
Cons:
not transactional
not everytime valid (before second request, one part is missing)
Examples:
Google Drive
YouTube
I can't think of any other approaches off the top of my head.
Of your 3 approaches, I've worked with method 3 the most. The biggest difference I see is between the first method and the other 2: Separating metadata and content into 2 resources
Pro: Scalability
while your solution involves posting to the same server, this can easily be changed to point the content upload to a separate server (i.e. Amazon S3)
In the first method, the same server that serves metadata to users will have a process blocked by a large upload.
Con: Orphaned Data/Added complexity
failed uploads (either metadata or content) will leave orphaned data in the server DB
Orphaned data can be cleaned up with a scheduled job, but this adds code complexity
Method II reduces the orphan possibilities, at the cost of longer client wait time as you're blocking on the response of the first POST
The first method seems the most straightforward to code. However, I'd only go with the first method if anticipate this service being used infrequently and you can set a reasonable limit on the user file uploads.
I believe the ultimate method is number 3 (separate resource) for the main reason that it allows maximizing the value I get from the HTTP standard, which matches how I think of REST APIs. For example, and assuming a well-grounded HTTP client is in the use, you get the following benefits:
Content compression: You optimize by allowing servers to respond with compressed result if clients indicate they support, your API is unchanged, existing clients continue to work, future clients can make use of it
Caching: If-Modified-Since, ETag, etc. Clients can advoid refetching the binary data altogether
Content type abstraction: For example, you require an uploaded image, it can be of types image/jpeg or image/png. The HTTP headers Accept and Content-type give us some elegant semantics for negotiating this between clients and servers without having to hardcode it all as part of our schema and/or API
On the other hand, I believe it's fair to conclude that this method is not the simplest if the binary data in question is not optional. In which case the Cons listed in Eric Hu's answer will come into play.