Http Headers as String: Which delimiter to use? - rest

If I am getting a bunch of HTTP headers as a single string, what is the best delimiter to use to separate each header name/value(s) pair? I have though of using commas but they seem to occur within the value of certain HTTP headers. Is there any character that is not allowed for HTTP headers that I can use ?

One straightforward choice would be to use the same delimiter used by HTTP messages themselves. The grammar for messages can be found in RFC 7230 Section 3:
HTTP-message = start-line
*( header-field CRLF )
CRLF
[ message-body ]
where CRLF is defined as a carriage return followed by a line feed.
So my suggestion is to use that.

Related

Naming convention behind HTTP headers

I'm building an API and I need to trace the message as it goes trough my systems. For this purpose, I'm planning to use X-Correlation-Id HTTP header, but here I have an into problems.
First, the X- prefix is deprecated and its use discouraged. That would leave me with Correlation-Id.
Second, I'm intending to use hyphens, but as I'm reading https://www.rfc-editor.org/rfc/rfc7230 I realize that hyphen isn't named anywhere explicitly. section-3.2.6 Mentions excluded characters - those are the characters unsusable for http header delimeters?
Third, since the HTTP headers are case-insensitive, should I define all of my expected headers in lower case? I'm asking because the more research I do, the more variations on this I find. Some APIs have capitalized headers, some dont. Some use hyphens and some use underscores.
Are there unambiguous guidelines for this stuff? I'm telling my boss that using camelcase for HTTP headers is not optimal but I cant find any guidelines for this.
As long the use is private, the field name does not matter a lot. In doubt, use "application name" instead of "X".
Field names are use the "token" ABNF, and that does include "-".
As they are case-insensitive, it doesn't matter how you "define" them. In doubt, use the same convention as the HTTP spec. CamelCase in particular is not helpful, as HTTP/2 (and 3) lowercase everything on the wire.
RFC 2822 defines the production rules for Headers
A field name MUST be composed of printable US-ASCII characters (i.e., characters that have values between 33 and 126, inclusive), except colon.
You'll find an ABNF representation in the section that describes optional fields
optional-field = field-name ":" unstructured CRLF
field-name = 1*ftext
ftext = %d33-57 / ; Any character except
%d59-126 ; controls, SP, and
; ":".
In HTTP specifically, you'll want to be paying attention to the production rules defined in Appendix B (note that HYPHEN-MINUS is a permitted tchar)
field-name = token
header-field = field-name ":" OWS field-value OWS
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
"^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
token = 1*tchar
If you review the IANA Message Headers Registry, you'll find quite a few headers that include hyphens in the spelling. Careful examination will show a number of standard HTTP headers that include hyphens (Accept-Language, Content-Type, etc).
Third, since the HTTP headers are case-insensitive, should I define all of my expected headers in lower case?
For your specification, I would recommend spelling conventions consistent with the entries in the IANA registry; in your implementation, you would use a case insensitive match of the specified spelling.

Are newlines in MIME headers using encoded-words legal?

RFC 2047 defines the encoded-words mechanism for encoding non-ASCII character in MIME documents. It specifies that whitespace characters (space and tabs) are not allowed inside the encoded-word.
However, RFC 5322 for parsing email MIME documents specifies that long header lines should be "folded". Should this folding take place before or after encoded-words decoding?
I recently received an email where encoded-text part of the header had a newline in it, like this:
Header: =?UTF-8?Q?=C3=A5
=C3=A4?=
Would this be valid?
Of course emails can be invalid in lots of exciting ways and the parser needs to handle that, but it's interesting to know the "correct" way. :)
I misread the question and answered as if it was a different sort of whitespace. In this case the white space appears inside the MIME word, not multiple ones separated by white space.
This sort of thing is explicitly disallowed. From the introduction to the format in RFC2047:
2. Syntax of encoded-words
An 'encoded-word' is defined by the following ABNF grammar. The
notation of RFC 822 is used, with the exception that white space
characters MUST NOT appear between components of an 'encoded-word'.
And then later on in the same section:
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser. As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'. For example, the character sequence
=?iso-8859-1?q?this is some text?=
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
text" is to encode the SPACE characters as well, e.g.
=?iso-8859-1?q?this=20is=20some=20text?=
The characters which may appear in 'encoded-text' are further
restricted by the rules in section 5.
Earlier answer
This sort of thing is explicitly allowed. Headers with MIME words should be 76 characters or less and folded if needed. RFC822 folded headers are indented second and any additional lines. RFC2047 headers are supposed to only indent one space. The whitespace between ?= on the first line and =? should be suppressed from output.
See the example on the bottom of page 12 of the RFC:
encoded form displayed as
---------------------------------------------------------------------
(=?ISO-8859-1?Q?a?= (ab)
=?ISO-8859-1?Q?b?=)
Any amount of linear-space-white between 'encoded-word's,
even if it includes a CRLF followed by one or more SPACEs,
is ignored for the purposes of display.

hcm cloud rest call from jersey with query parameter having space [duplicate]

This question already has answers here:
Is a URL allowed to contain a space?
(10 answers)
Closed 8 years ago.
w3fools claims that URLs can contain spaces: http://w3fools.com/#html_urlencode
Is this true? How can a URL contain an un-encoded space?
I'm under the impression the request line of an HTTP Request uses a space as a delimiter, being formatted as {the method}{space}{the path}{space}{the protocol}:
GET /index.html http/1.1
Therefore how can a URL contain a space? If it can, where did the practice of replacing spaces with + come from?
A URL must not contain a literal space. It must either be encoded using the percent-encoding or a different encoding that uses URL-safe characters (like application/x-www-form-urlencoded that uses + instead of %20 for spaces).
But whether the statement is right or wrong depends on the interpretation: Syntactically, a URI must not contain a literal space and it must be encoded; semantically, a %20 is not a space (obviously) but it represents a space.
They are indeed fools. If you look at RFC 3986 Appendix A, you will see that "space" is simply not mentioned anywhere in the grammar for defining a URL. Since it's not mentioned anywhere in the grammar, the only way to encode a space is with percent-encoding (%20).
In fact, the RFC even states that spaces are delimiters and should be ignored:
In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
have to be added to break a long URI across lines. The whitespace
should be ignored when the URI is extracted.
and
For robustness, software that accepts user-typed URI should attempt
to recognize and strip both delimiters and embedded whitespace.
Curiously, the use of + as an encoding for space isn't mentioned in the RFC, although it is reserved as a sub-delimeter. I suspect that its use is either just convention or covered by a different RFC (possibly HTTP).
Spaces are simply replaced by "%20" like :
http://www.example.com/my%20beautiful%20page
The information there is I think partially correct:
That's not true. An URL can use spaces. Nothing defines that a space is replaced with a + sign.
As you noted, an URL can NOT use spaces. The HTTP request would get screwed over. I'm not sure where the + is defined, though %20 is standard.

Is it appropriate or necessary to use percent-encoding with HTTP Headers?

When I'm building RESTful client and servers, is it appropriate or necessary to use percent-encoding with HTTP Headers (request or response), or does this type of encoding just apply to URIs?
Basically No, but see below.
RFC2616 describes percent-encoding only for URIs (search for % or HEX HEX or percent) and it defines the field-value without mentioning percent-encoding.
However, RFC2616 allows arbitraty octets (except CTLs) in the header field value, and has a half-baked statement mentioning MIME encoding (RFC2047) for characters not in ISO-8859-1 (see definition of TEXT in its Section 2.2). I called that statement "half-baked" because it does not exlictly state that ISO-8859-1 is the mandatory character set to be used for interpreting the octets, but despite of that, it normatively requires the use of MIME encoding for characters outside of that character set. It seems that both the use of ISO-8859-1 and the MIME encoding of header field values are not widely supported.
HTTPbis seems to have given up on this, and goes back to US-ASCII for header field values. See this answer for details.
My reading of this is:
For standard header fields (those defined in RFC2616), percent-encoding is not permitted.
For extension header fields, percent-encoding is not described in RFC2616, but there is room for applying all kinds of encodings, including percent-encoding, as long as the resulting characters are US-ASCII (if you want to be future-proof). Just don't think you have to use percent-encoding.
Some more sources I found:
https://www.quora.com/Do-HTTP-headers-need-to-be-encoded confirms my understanding, although it is not specific about standard headers vs extension headers and does not quote a source.
https://support.ca.com/us/knowledge-base-articles.TEC1904612.html argues that the percent-encoding of extension headers in their product is a measure of protection against CSS attacks.
TL;DR: Octet percent-encoding and base64 encoding are fine.
Indicating Character Encoding and Language for HTTP Header Field Parameters
https://www.rfc-editor.org/rfc/rfc8187
This document specifies an encoding suitable for use in HTTP header
fields...
Read the "3.2.3. Examples"
base64 encoding is fine too, as read the HTTP Basic Authorziation spec: https://www.rfc-editor.org/rfc/rfc7617

Parsing of HTTP Headers Values: Quoting, RFC 5987, MIME, etc

What confuses me is decoding of HTTP header values.
Example Header:
Some-Header: "quoted string?"; *utf-8'en'Weirdness
Can header value's be quoted? What about the encoding of a " itself? is ' a valid quote character? What's the significance of a semi-colon (;)? Could the value parser for a HTTP header be considered a MIME parser?
I am making a transparent proxy that needs to transparently handle and modify many in-the-wild header fields. That's why I need so much detail on the format.
Can header values be quoted?
If you mean does the RFC 5987 parameter production apply to the main part of the header value, then no.
Some-Header: "foo"; bar*=utf-8'en'bof
Here the main part of the header value would probably be "foo" including the quotes, but...
What's the significance of a semi-colon (;)?
The specific handling is defined for each named header separately. So semicolon is significant for, say, Content-Disposition, but not for Content-Length.
Obviously this is not a very satisfactory solution but that's what we're stuck with.
I am making a transparent proxy that needs to transparently handle and modify many in-the-wild header fields.
You can't handle these in a generic way, you have to know the form of each possible header. For anything you don't recognise, don't attempt to decompose the header value; and really, so little out there supports RFC 5987 at the moment, it's unlikely you'll be able to do much useful handling of it.
Status quo today is that non-ASCII characters in header values doesn't work well enough cross-browser to be used at all, either encoded or raw.
Luckily they are rarely needed. The only really common use case is non-ASCII filenames for Content-Disposition but that's easier to work around by putting the filename in a trailing URL path part instead.
Could the value parser for a HTTP header be considered a MIME parser?
No. HTTP borrows heavily from MIME and the RFC 822 family of standards in general, but it isn't part of the 822 family. It has its own low-level grammar for headers which looks like 822, but isn't quite compatible. Arbitrary MIME features can't be used in HTTP, there has to be a standardisation mechanism to drag them into HTTP explicitly—which is what RFC 5987 is, for (parts of) RFC 2231.
(See section 19.4 of RFC 2616 for discussion of some other differences.)
In theory, a multipart form submission is part of the 822 family and you should be able to use RFC 2231 encoding there. But the reality is browsers don't support that either.