How to make sure that you have read the "full message" when using "Transfer-Encoding:chunked" http response - httpurlconnection

I am using HttpUrlConnection (java) to read the http chunked response(Transfer-Encoding:chunked) like following and I am able to read the message. But, how can I make sure that I have read all the chunks the correctly and the message read is intact..
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();

Hey I'm sorry for the late response but, as you might guess, I only just came across your question.
You seem to have a small misunderstanding of what chunked transfer encoding means. It does not mean that the message will be sent in CRLF ended lines. That seems to be what you're trying to implement, but it's a little more complicated than that.
Incase you're not familiar with what a CRLF is, it just means Carriage-Return Line-Feed (or \r\n in traditional Unix escape sequences).
Chunked transfer encoding means that the message will be sent in chunks not lines. Each chunk has a format that is even more self explanatory than a line ending in a CRLF.
How Chunks Work
Each chunk begins with a sequence of characters in the range [0-9, 'a'-'f']. This is a hexadecimal number representing the length, in bytes, of the chunk. This hexadecimal number will be followed by a CRLF. This will be followed by the chunk data (hopefully with the specified length). The chunk data will be ended with another CRLF.
You will get several of these chunks sequentially, and it is your job to concatenate them in the order in which they arrived. You read them in order until you get a special chunk that signals the end of the message body. The only thing special about this last chunk is that it will have a length of zero and contain no data, i.e. the chunk will be 0\r\n\r\n.
Unfortunately Java's HttpURLConnection class only implements automatic chunked transfer encoding, and you have to do the decoding yourself.
Cumbersome as it might be, if you implement this protocol you can be confident about whether you have read the entire message...with one caveat...
You Might Have to Look for Trailers
Trailers are the same as headers except they come at the end of an HTTP message, whereas headers come at the beginning. Chunked transfer encoding will allow you to find the end of a message body but the full HTTP message might continue.
The HttpURLConnection class will not parse trailers for you. The class parses the HTTP head (status line and headers) before directing its socket (or whatever it uses) to the connection's OutputStream. Once the socket is directed toward that OutputStream it doesn't go back. It's up to you to process the data from then on.
Luckily trailers are much more predictable than headers...or at least they are supposed to be. Every trailer the server will send is supposed to be specified in a semicolon separated list in the Trailers header field, i.e. Trailers: Content-disposition; Cache-control; From in the headers means you should look for Content-disposition, Cache-control, and From trailer fields after the message.
Trailers are allowed only in chunked encoded messages, and they follow the same format as headers: field name, colon, space, field value, CRLF. Just like with headers, if a trailer is followed by two CRLF's instead of one that means it was the last trailer.

Related

javax.mail.Part and writeTo, unable to obtain the same "eml" file as the original one

My application parses many messages via javamail 1.5.6, it listens for incoming messages then store some info about them.
Almost all messages contain a digital signature, so my application needs to retrieve the full eml too, that is the raw file representing an email, in this way application users can always prove the validity of these messages.
So, once I have a javax.mail.Message, then I have to produces its eml, so I do:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
m.writeTo(baos);
this.originalMessage = baos.toString(StandardCharsets.UTF_8.name());
this approach generally works, but I had some multipart messages having part like the following:
This is a multi-part message in MIME format.
--------------55D0DAEBFD4BF19F87D16E72 Content-Type: text/plain; charset=iso-8859-15; format=flowed Content-Transfer-Encoding: 8bit
In allegato si notifica ai sensi e per gli effetti dell'art. 11 R.D.
1611/1993, al messaggio PEC, oltre alla Relata di Notifica e
contestuale attestazione di conformità,
--------------55D0DAEBFD4BF19F87D16E72
word "conformità" is not properly transformed in the resulting string, it becomes "conformit�", opening such eml for example with MS Outlook results in an invalid digital signature, so message appears corrupted, different from the original
Have you same idea? Thank you very much
The raw message is not a UTF-8 encoded string, nor is an "eml" file a UTF-8 encoded file. They are both byte streams, and your digital signature should work on byte streams.
In your particular example, the content of the message part is encoded using the iso-8859-15 charset, not UTF-8.

HTTP multipart/form-data. What happends when binary data has no string representation?

I want to write an HTTP implementation.
I've been looking around for a few days about sending files over HTTP with Content-Type: multipart/form-data, and I'm really interested about how browsers (or any HTTP client) creates that kind of request.
I already took a look at a lots of questions about it here at stackoverflow like:
How does HTTP file upload work?
What does enctype='multipart/form-data' mean?
I dig into RFCs 2616 (and newer versions), 2046, etc. But I didn't find a clear answer (obviously I did not get the idea behind it).At most articles and answers I found this piece of request string, that's is simple to me to interpret, all these things are documented at RFCs...
POST /upload?upload_progress_id=12344 HTTP/1.1
Host: localhost:3000
Content-Length: 1325
Origin: http://localhost:3000
... other headers ...
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryePkpFF7tjBAqx29L
------WebKitFormBoundaryePkpFF7tjBAqx29L
Content-Disposition: form-data; name="MAX_FILE_SIZE"
100000
------WebKitFormBoundaryePkpFF7tjBAqx29L
Content-Disposition: form-data; name="uploadedfile"; filename="hello.o"
Content-Type: application/x-object
... contents of file goes here ...
------WebKitFormBoundaryePkpFF7tjBAqx29L--
...and it would be simple to implement an HTTP client to construct a piece of string that way in any language.The problem becomes at ... contents of file goes here ..., there's little information about what "contents of file" is. I know it's binary data with a certain type and encoding, but It's difficult to think out of string data, how I would add a piece of binary data that has no string representation inside a string.
I would like to see examples of low level implementations of HTTP protocol with any language. And maybe in depth explanations about binary data transfer over HTTP, how client creates requests and how server read/parse it. PD. I know this question my look a duplicate but most of the answers are not focused on explaining binary data transfer (like media).
You should not try to handle strings on this part of the body, you should send binary data, see it as reading bytes from the resource and sending theses bytes unaltered.
So especially no encoding applied, no utf-8, no base64, HTTP is not a protocol with an ascii7 restriction like smtp, where base64 encoding is applied to ensure only ascii7 characters are used.
There is, by definition, no string version of this data, and looking at raw HTTP transfer (with wireshark for example) you should see binary data, bytes, stuff.
This is why most HTTP servers uses C to manage HTTP, they parse the HTTP communication byte per byte (as the protocol headers are ascii 7 only, certainly not multibytes characters) and they can also read/write arbitrary
binary data for the body quite easily (or even using system calls like readfile to let the kernel manage the binary part).
Now, about examples.
When you use Content-Length and no multipart stuff the body is exactly (content-length) bytes long, so the client parsing your sent data will just read this number of bytes and will treat this whole raw data as the body content (which may have a mime type and and encoding information, but that's just informations for layers set on top of the HTTP protocol).
When you use Transfer-Encoding: chunked, the raw binary body is separated into pieces, each part is then prefixed by an hexadecimal number (the size of the chunk) and the end of line marker. With a final null marker at the end.
If we take the wikipedia example:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
E\r\n
in\r\n
\r\n
chunks.\r\n
0\r\n
\r\n
We could replace each ascii7 letter by any byte, even a byte that would have no ascii7 representation, Ill use a * character for each real body byte:
4\r\n
****\r\n
5\r\n
*****\r\n
E\r\n
**************\r\n
0\r\n
\r\n
All the other characters are part of the HTTP protocol (here a chunked body transmission). I could also use a \n representation of binary data, and send only the null byte for each byte of the body, that would be:
4\r\n
\0\0\0\0\0\r\n
5\r\n
\0\0\0\0\0\0\r\n
E\r\n
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r\n
0\r\n
\r\n
That's just a representation, we could also use \xNN or \NN representations, in reality these are bytes, 8 bits (too lazy to write the 0/1 representation of this body :-) ).
If the text of the example, instead of being:
Wikipedia in\r\n
\r\n
chunks.
It could have been a more complex one, with multibytes characters (here a é in utf-8):
Wikipédia in\r\n
\r\n
chunks.
This é is in fact 11000011:10101001 in utf-8, two bytes: \xc3\xa9 in \xNN representation), instead of the simple 01100101 / \x65 / echaracter. The HTTP body is now (see that second chunk size is 6 and not 5):
4\r\n
Wiki\r\n
6\r\n
p\xc3\xa9dia\r\n
E\r\n
in\r\n
\r\n
chunks.\r\n
0\r\n
\r\n
But this is only valid if the source data was effectively in utf-8, could have been another encoding. By default, unless you have some specific configuration settings available in your web server where you enforce a conversion of the source document in a specific encoding, that's not really the job of the web server to convert the source document, you take what you have, and you maybe add an header to tell the client what encoding was defined on the source document.
Finally we have the multipart way of transmitting the body, like in your question, it's a lot like the chunked version, except here boundaries and intermediary headers are used, but for the binary data between these boundaries, headers, and line endings control characters it is the same rule, everything inside are just bytes...

PlayWS calculate the size of a http call without consuming the stream

I'm currently using the PlayWS http client which returns an Akka stream. From my understanding, I can consume the stream and turn it into a Byte[] to calculate the size. However, this also consumes the stream and I can't use it anymore. Anyway around this?
I think there are two different aspects related to the question.
You want to know the size of the server response in advance to prepare buffer. Unfortunately there is no guaranteed way to do this. HTTP 1.1 spec explicitly allows transfer mode when the server does not know the size of the response in advance via chunked transfer encoding. See also quote from 3.3.1. Transfer-Encoding:
A recipient MUST be able to parse the chunked transfer coding
(Section 4.1) because it plays a crucial role in framing messages
when the payload body size is not known in advance.
Section 3.3.3. Message Body Length specifies how length of a message body is defined and it besides the aforementioned chunked transfer encoding it also contains quite unhelpful
Otherwise, this is a response message without a declared message
body length, so the message body length is determined by the
number of octets received prior to the server closing the
connection.
This is added for backward compatibility and discouraged from usage but is still legally allowed.
Still in many real world scenarios you can use Content-Length header field that the server may return. However there is a catch here as well: if gzip Content-Encoding is used, then Content-Length will contain size of the compressed body.
To sum up: in general case you can't get the size of the message body in advance before you fully get the server response i.e. in terms of code perform a blocking call on the response. You may try to use Content-Length and it might or might not help in your specific case.
You already have a fully downloaded response (or you are OK with blocking on your StreamedResponse) and you want to process it by first getting the size and only then processing the actual data. In such case you may first use getBodyAsBytes method which returns IndexedSeq[Byte] and thus has size, and then convert it into a new Source using Source.single which is actually exactly what the default (i.e. non-streaming) implementation of getBodyAsSource does.

Ensure Completeness of HTTP Messages

I am currently working on an application that is supposed to get a web page and extract information from its content.
As I learned from my research (or as it seems to me at least), there is no ideal way to determine the end of an HTTP message.
Generally, I found two different ways to do so:
Set O_NONBLOCK flag for the socket and fetch data with recv() in a while loop. Assume that the message is complete and break if it occurs once that there are no bytes in the stream.
Rely on the HTTP Content-Length header and determine the end of the message with it.
Both ways don't seem to be completely safe to me. Solution (1) could possibly break the recv loop before the message was completed. On the other hand, solution (2) requires the Content-Length header to be set correctly.
What's the best way to proceed in this case? Can I always rely on the Content-Length header to be set?
Let me start here:
Can I always rely on the Content-Length header to be set?
No, you can't. Content-Length is an optional header. However, HTTP messages absolutely must feature a way to determine their body length if they are to be RFC-compliant (cf RFC7230, sec. 3.3.3). That being said, get ready to parse on chunked encoding whenever a content length isn't specified.
As for your original problem: Ensuring the completeness of a message is actually something that should be TCP's job. But as there are such complicated things like message pipelining around, it is best to check for two things in practice:
Have all reads from the network buffer been successful?
Is the number of the received bytes identical to the predicted message length?
Oh, and as #MartinJames noted, non-blocking probably isn't the best idea here.
The end of a HTTP response is defined:
By the final (empty) chunk in case Transfer-Encoding chunked is used.
By reaching the given length if a Content-length header is given and no chunked transfer encoding is used.
By the end of the TCP connection if neither chunked transfer encoding is used not Content-length is given.
In the first two cases you have a well defined end so you can verify that the data were fully received. Only in the last case (end of TCP connection) you don't know if the connection was closed before sending all the data. But usually you get either case 1 or case 2.
To make your life easier, you might want to provide
Connection: close
header when making HTTP request - than web-server will close connection after giving you the full page requested and you will not have to deal with chunks.
It is only a viable option if you only are interested in this single page, and will not request additional resources (script files, images, etc) - in latter case this will be a very inefficient solution for both your app and the server.

Need to find the requests equivalent of openurl() from urllib2

I am currently trying to modify a script to use the requests library instead of the urllib2 library. I haven't really used it before and I am looking to do the equivalent of urlopen("http://www.example.org").read(), so I tried the requests.get("http://www.example.org").text function.
This works fine with normal everyday html, however when I fetch from this url (https://gtfsrt.api.translink.com.au/Feed/SEQ) it doesn't seem to work.
So I wrote the below code to print out the responses from the same url using both the requests and urllib2 libraries.
import urllib2
import requests
#urllib2 request
request = urllib2.Request("https://gtfsrt.api.translink.com.au/Feed/SEQ")
result = urllib2.urlopen(request)
#requests request
result2 = requests.get("https://gtfsrt.api.translink.com.au/Feed/SEQ")
print result2.encoding
#urllib2 write to text
open("Output.txt", 'w').close()
text_file = open("Output.txt", "w")
text_file.write(result.read())
text_file.close()
open("Output2.txt", 'w').close()
text_file = open("Output2.txt", "w")
text_file.write(result2.text)
text_file.close()
The openurl().read() works fine but the requests.get().text doesn't work for the given this url. I suspect it has something to do with encoding, but i don't know what. Any thoughts?
Note: The supplied url is a feed in the google protocol buffer format, once I receive the message i give the feed to a google library that interprets it.
Your issue is that you're making the requests module interpret binary content in a response as text.
A response from the requests library has two main way to access the body of the response:
Response.content - will return the response body as a bytestring
Response.text - will decode the response body as text and return unicode
Since protocol buffers are a binary format, you should use result2.content in your code instead of result2.text.
Response.content will return the body of the response as-is, in bytes. For binary content this is exactly what you want. For text content that contains non-ASCII characters this means the content must have been encoded by the server into a bytestring using a particular encoding that is indicated by either a HTTP header or a <meta charset="..." /> tag. In order to make sense of those bytes they therefore need to be decoded after receiving using that charset.
Response.text now is a convenience method that does exactly this for you. It assumes the response body is text, and looks at the response headers to find the encoding, and decodes it for you, returning unicode.
But if your response doesn't contain text, this is the wrong method to use. Binary content doesn't contain characters, because it's not text, so the whole concept of character encoding does not make any sense for binary content - it's only applicable to text composed of characters. (That's also why you're seeing response.encoding == None - it's just bytes, there is no character encoding involved).
See Response Content and Binary Response Content in the requests documentation for more details.