I am currently working on an application that is supposed to get a web page and extract information from its content.
As I learned from my research (or as it seems to me at least), there is no ideal way to determine the end of an HTTP message.
Generally, I found two different ways to do so:
Set O_NONBLOCK flag for the socket and fetch data with recv() in a while loop. Assume that the message is complete and break if it occurs once that there are no bytes in the stream.
Rely on the HTTP Content-Length header and determine the end of the message with it.
Both ways don't seem to be completely safe to me. Solution (1) could possibly break the recv loop before the message was completed. On the other hand, solution (2) requires the Content-Length header to be set correctly.
What's the best way to proceed in this case? Can I always rely on the Content-Length header to be set?
Let me start here:
Can I always rely on the Content-Length header to be set?
No, you can't. Content-Length is an optional header. However, HTTP messages absolutely must feature a way to determine their body length if they are to be RFC-compliant (cf RFC7230, sec. 3.3.3). That being said, get ready to parse on chunked encoding whenever a content length isn't specified.
As for your original problem: Ensuring the completeness of a message is actually something that should be TCP's job. But as there are such complicated things like message pipelining around, it is best to check for two things in practice:
Have all reads from the network buffer been successful?
Is the number of the received bytes identical to the predicted message length?
Oh, and as #MartinJames noted, non-blocking probably isn't the best idea here.
The end of a HTTP response is defined:
By the final (empty) chunk in case Transfer-Encoding chunked is used.
By reaching the given length if a Content-length header is given and no chunked transfer encoding is used.
By the end of the TCP connection if neither chunked transfer encoding is used not Content-length is given.
In the first two cases you have a well defined end so you can verify that the data were fully received. Only in the last case (end of TCP connection) you don't know if the connection was closed before sending all the data. But usually you get either case 1 or case 2.
To make your life easier, you might want to provide
Connection: close
header when making HTTP request - than web-server will close connection after giving you the full page requested and you will not have to deal with chunks.
It is only a viable option if you only are interested in this single page, and will not request additional resources (script files, images, etc) - in latter case this will be a very inefficient solution for both your app and the server.
Related
My code is working ok for GET/POST/PUT to/from restApi1 and restApi2.
However, my problem I need to implement HEAD/OPTIONS (no body!) and GET uri1
HEAD/OPTIONS could return 204 or 200 depends on a process status. I am getting error "Stream closed". Sounds like Camel want body bytes, but I don't indend to have it. Even I set ExchangePattern.InOnly or optional etc error occur...
What is correct way to see responses and handle requests WITHOUT body, just statuses exchange?
How to see response from restApi2 on Camel rest("/restApi1").head().route().routeId("id1")
.to("direct:restApi2").routeId("/id1").setHeader(Exchange.HTTP_METHOD,constant("HEAD"))
setExchanggePattern(ExchangePattern.OutOptionalIn).recepientList(simple(restApi2));
I figured it out. Need to set '.convertBodyTo(String.class)' even I don't have a body.
I'm currently using the PlayWS http client which returns an Akka stream. From my understanding, I can consume the stream and turn it into a Byte[] to calculate the size. However, this also consumes the stream and I can't use it anymore. Anyway around this?
I think there are two different aspects related to the question.
You want to know the size of the server response in advance to prepare buffer. Unfortunately there is no guaranteed way to do this. HTTP 1.1 spec explicitly allows transfer mode when the server does not know the size of the response in advance via chunked transfer encoding. See also quote from 3.3.1. Transfer-Encoding:
A recipient MUST be able to parse the chunked transfer coding
(Section 4.1) because it plays a crucial role in framing messages
when the payload body size is not known in advance.
Section 3.3.3. Message Body Length specifies how length of a message body is defined and it besides the aforementioned chunked transfer encoding it also contains quite unhelpful
Otherwise, this is a response message without a declared message
body length, so the message body length is determined by the
number of octets received prior to the server closing the
connection.
This is added for backward compatibility and discouraged from usage but is still legally allowed.
Still in many real world scenarios you can use Content-Length header field that the server may return. However there is a catch here as well: if gzip Content-Encoding is used, then Content-Length will contain size of the compressed body.
To sum up: in general case you can't get the size of the message body in advance before you fully get the server response i.e. in terms of code perform a blocking call on the response. You may try to use Content-Length and it might or might not help in your specific case.
You already have a fully downloaded response (or you are OK with blocking on your StreamedResponse) and you want to process it by first getting the size and only then processing the actual data. In such case you may first use getBodyAsBytes method which returns IndexedSeq[Byte] and thus has size, and then convert it into a new Source using Source.single which is actually exactly what the default (i.e. non-streaming) implementation of getBodyAsSource does.
I can't get one thing straight. The RFC 2616 in 4.4.5 states that Message Length can be determined "By the server closing the connection.".
This implies, that it is valid for a server to respond (e.g. returning a large image) with a response, that has no Content-Length in the header, but the client is supposed to keep fetching till the connection is closed and then assume all data has been downloaded.
But how is a client to know for sure that the connection was closed intentionally by the server? A server app could have crashed in the middle of sending the data and the server's OS would most likely send FIN packet to gracefully close the TCP connection with the client.
You are absolutely right, that mechanism is totally unreliable. This is covered in RFC 7230:
Since there is no way to distinguish a successfully completed,
close-delimited message from a partially received message interrupted
by network failure, a server SHOULD generate encoding or
length-delimited messages whenever possible. The close-delimiting
feature exists primarily for backwards compatibility with HTTP/1.0.
Fortunately most of HTTP traffic today are HTTP/1.1, with Content-Length or "Transfer-Encoding" to explicitly define the end of message.
The lesson is that, a message must have it own way of termination; we cannot repurpose the underlying transport layer's EOF as the message's EOF.
On that note, a (well-formed) html document, or a .gif, .avi etc, does define its own termination; we will know if we received an incomplete document. Therefore it is not so much of a problem to transmit it over HTTP/1.0 without Content-Length.
However, for plain text document, javascript, css etc. EOF is used to marked the end of the document, therefore it's problematic over HTTP/1.0.
I wrote a TCP socket program,and define a text protocol format like: "length|content",
to make it simple, the "length" is always 1-byte-long and it define the number of bytes of "content"
My problem is:
when attackers send packets like "1|a51",it will stay in tcp's receive buffer
the program will parse it wrong and the next packet would start like "5|1XXXX",
then the rest of the packets remain in the buffer would all parsed wrong,
how to solve this problem?
If you get garbage, just close the connection. It's not your problem to figure out what they meant, if anything.
instead of length|content only, you also need to provide a checksum, if the checksum is not correct, you should drop the connection to avoid partial receive.
this is a typical problem in tcp protocol, since the tcp is stream based. but just as http, which is an application of tcp protocol, it has a structure of request / response to make sure each end of the connection knows when the data has been fully transferred.
but your scenario is a little bit tricky, since the hacker can only affect the connection of his own. while it cannot change the data from other connections, only if he can control the route / switcher between your application and the users.
I have a client/server system that performs communication using XML transferred using HTTP requests and responses with the client using Perl's LWP and the server running Perl's CGI.pm through Apache. In addition the stream is encrypted using SSL with certificates for both the server and all clients.
This system works well, except that periodically the client needs to send really large amounts of data. An obvious solution would be to compress the data on the client side, send it over, and decompress it on the server. Rather than implement this myself, I was hoping to use Apache's mod_deflate's "Input Decompression" as described here.
The description warns:
If you evaluate the request body yourself, don't trust the Content-Length header! The Content-Length header reflects the length of the incoming data from the client and not the byte count of the decompressed data stream.
So if I provide a Content-Length value which matches the compressed data size, the data is truncated. This is because mod_deflate decompresses the stream, but CGI.pm only reads to the Content-Length limit.
Alternatively, if I try to outsmart it and override the Content-Length header with the decompressed data size, LWP complains and resets the value to the compressed length, leaving me with the same problem.
Finally, I attempted to hack the part of LWP which does the correction. The original code is:
# Set (or override) Content-Length header
my $clen = $request_headers->header('Content-Length');
if (defined($$content_ref) && length($$content_ref)) {
$has_content = length($$content_ref);
if (!defined($clen) || $clen ne $has_content) {
if (defined $clen) {
warn "Content-Length header value was wrong, fixed";
hlist_remove(\#h, 'Content-Length');
}
push(#h, 'Content-Length' => $has_content);
}
}
elsif ($clen) {
warn "Content-Length set when there is no content, fixed";
hlist_remove(\#h, 'Content-Length');
}
And I changed the push line to:
push(#h, 'Content-Length' => $clen);
Unfortunately this causes some problem where content (truncated or not) doesn't even get to my CGI script.
Has anyone made this work? I found this which does compression on a file before uploading, but not compressing a generic request.
Although you said you didn't want to do the compression yourself, there are lots of perl modules which will do both sides for you, Compress::Zlib for example.
I have a cheat (with a .net part of the company) where I get passed XML as a separate parameter posted in, then can handle it as if it was a string rather than faffing about with SOAP like stuff.
I don't think you can change the Content-Length like that. It would confuse Apache, because mod_deflate wouldn't know how much compressed data to read. What about having the client add an X-Uncompressed-Length header, and then use a modified version of CGI.pm that uses X-Uncompressed-Length (if present) instead of Content-Length? (Actually, you probably don't need to modify CGI.pm. Just set $ENV{'CONTENT_LENGTH'} to the appropriate value before initializing the CGI object or calling any CGI functions.)
Or, use a lower-level module that uses the bucket brigade to tell how much data to read.
I am not sure if I am following you with what you want, but I have a custom get/post module, that I use to do some non standard stuff. The below code will read in anything sent via post, or STDIN.
read(STDIN, $query_string, $ENV{'CONTENT_LENGTH'});
Instead of using using $ENV's value use your's. I hope this helps, and sorry if it doesn't.