I'm building an API and I need to trace the message as it goes trough my systems. For this purpose, I'm planning to use X-Correlation-Id HTTP header, but here I have an into problems.
First, the X- prefix is deprecated and its use discouraged. That would leave me with Correlation-Id.
Second, I'm intending to use hyphens, but as I'm reading https://www.rfc-editor.org/rfc/rfc7230 I realize that hyphen isn't named anywhere explicitly. section-3.2.6 Mentions excluded characters - those are the characters unsusable for http header delimeters?
Third, since the HTTP headers are case-insensitive, should I define all of my expected headers in lower case? I'm asking because the more research I do, the more variations on this I find. Some APIs have capitalized headers, some dont. Some use hyphens and some use underscores.
Are there unambiguous guidelines for this stuff? I'm telling my boss that using camelcase for HTTP headers is not optimal but I cant find any guidelines for this.
As long the use is private, the field name does not matter a lot. In doubt, use "application name" instead of "X".
Field names are use the "token" ABNF, and that does include "-".
As they are case-insensitive, it doesn't matter how you "define" them. In doubt, use the same convention as the HTTP spec. CamelCase in particular is not helpful, as HTTP/2 (and 3) lowercase everything on the wire.
RFC 2822 defines the production rules for Headers
A field name MUST be composed of printable US-ASCII characters (i.e., characters that have values between 33 and 126, inclusive), except colon.
You'll find an ABNF representation in the section that describes optional fields
optional-field = field-name ":" unstructured CRLF
field-name = 1*ftext
ftext = %d33-57 / ; Any character except
%d59-126 ; controls, SP, and
; ":".
In HTTP specifically, you'll want to be paying attention to the production rules defined in Appendix B (note that HYPHEN-MINUS is a permitted tchar)
field-name = token
header-field = field-name ":" OWS field-value OWS
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
"^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
token = 1*tchar
If you review the IANA Message Headers Registry, you'll find quite a few headers that include hyphens in the spelling. Careful examination will show a number of standard HTTP headers that include hyphens (Accept-Language, Content-Type, etc).
Third, since the HTTP headers are case-insensitive, should I define all of my expected headers in lower case?
For your specification, I would recommend spelling conventions consistent with the entries in the IANA registry; in your implementation, you would use a case insensitive match of the specified spelling.
Related
This question already has answers here:
Is a URL allowed to contain a space?
(10 answers)
Closed 8 years ago.
w3fools claims that URLs can contain spaces: http://w3fools.com/#html_urlencode
Is this true? How can a URL contain an un-encoded space?
I'm under the impression the request line of an HTTP Request uses a space as a delimiter, being formatted as {the method}{space}{the path}{space}{the protocol}:
GET /index.html http/1.1
Therefore how can a URL contain a space? If it can, where did the practice of replacing spaces with + come from?
A URL must not contain a literal space. It must either be encoded using the percent-encoding or a different encoding that uses URL-safe characters (like application/x-www-form-urlencoded that uses + instead of %20 for spaces).
But whether the statement is right or wrong depends on the interpretation: Syntactically, a URI must not contain a literal space and it must be encoded; semantically, a %20 is not a space (obviously) but it represents a space.
They are indeed fools. If you look at RFC 3986 Appendix A, you will see that "space" is simply not mentioned anywhere in the grammar for defining a URL. Since it's not mentioned anywhere in the grammar, the only way to encode a space is with percent-encoding (%20).
In fact, the RFC even states that spaces are delimiters and should be ignored:
In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
have to be added to break a long URI across lines. The whitespace
should be ignored when the URI is extracted.
and
For robustness, software that accepts user-typed URI should attempt
to recognize and strip both delimiters and embedded whitespace.
Curiously, the use of + as an encoding for space isn't mentioned in the RFC, although it is reserved as a sub-delimeter. I suspect that its use is either just convention or covered by a different RFC (possibly HTTP).
Spaces are simply replaced by "%20" like :
http://www.example.com/my%20beautiful%20page
The information there is I think partially correct:
That's not true. An URL can use spaces. Nothing defines that a space is replaced with a + sign.
As you noted, an URL can NOT use spaces. The HTTP request would get screwed over. I'm not sure where the + is defined, though %20 is standard.
I am trying to create multiple html files that are associated with an email address. But since the "#" cannot be used in filenames, and in order to avoid confusion, I am trying to replace it with a character that won't normally exist in an email address.
Anything comes in mind?
Thanks!
Comma and semi-colon is not allowed in email address but in filenames on most file systems.
I believe '~' is used for this purpose.
According to the link here almost all ASCII characters are allow in email addresses so long as the special characters aren't at the beginning or the end.
What characters are allowed in an email address?
Any of , (comma) ; (semi-colon) <> (angle brackets) [] (square brackets) or " (double quote) should work for most cases.
Since these characters are allowed in quoted strings, you could replace the "#" with a sequence that would be invalid such as three double quotes in a row.
According to the RFC
within a quoted string, any ASCII graphic or space is permitted without blackslash-quoting except double-quote and the backslash itself.
You could have an email abc."~~~".def#rst.xyz. But you could not have abc.""".def#rst.xyz; it would have to be abc.""".def#rst.xyz. So you could safely use """ as a substitute for # in the filename.
However, the RFC also says
While the above definition for Local-part is relatively permissive,
for maximum interoperability, a host that expects to receive mail
SHOULD avoid defining mailboxes where the Local-part requires (or
uses) the Quoted-string form or where the Local-part is case-
sensitive.
With SHOULD meaning "...that
there may exist valid reasons in particular circumstances when the
particular behavior is acceptable or even useful, but the full
implications should be understood and the case carefully weighed
before implementing..." RFC2119
So, although """ will work, are the chances you will see an email with quotes worth the trouble of designing for it? If not, then use one of the single characters.
I'm really confused when it comes to the format of Content-Id headers in message parts.
It seems to me that only RFC 2045 covers the format of the header, however briefly:
In constructing a high-level user agent, it may be desirable to allow
one body to make reference to another. Accordingly, bodies may be
labelled using the "Content-ID" header field, which is syntactically
identical to the "Message-ID" header field:
id := "Content-ID" ":" msg-id
Like the Message-ID values, Content-ID values must be generated to be
world-unique.
RFC 2822 explains the format of a msg-id token like so:
The message identifier (msg-id) is similar in syntax to an angle-addr
construct without the internal CFWS.
message-id = "Message-ID:" msg-id CRLF
in-reply-to = "In-Reply-To:" 1*msg-id CRLF
references = "References:" 1*msg-id CRLF
msg-id = [CFWS] "<" id-left "#" id-right ">" [CFWS]
id-left = dot-atom-text / no-fold-quote / obs-id-left
id-right = dot-atom-text / no-fold-literal / obs-id-right
no-fold-quote = DQUOTE *(qtext / quoted-pair) DQUOTE
no-fold-literal = "[" *(dtext / quoted-pair) "]"
Long story short: it includes the at ('#') symbol, just like the Message-Id header of a message. However, almost all reader-friendly articles on MIME format give examples of Content-Id without the at symbol (including not-really-global identifiers like myimagecid or inlineimage001 as well as randomly generated UUIDS without the at symbol). They would surely stress the importance of the '#' symbol if that would be necessary, just like they do with the Message-Id header, right? Right?
I've run some tests on real-world email clients and see how they compose emails with embedded inline images:
Thunderbird generates identifiers with the at symbol. Example: part1.12345678.12345678#domain.example.com
Gmail generates identifiers without such symbol and with no domain part. Example: ii_abc1234x0_12345ab12abcdefa
I didn't test any more email clients (if someone did, it'd be great to complete the list above), but these two already show the striking difference. Google not obeying RFC standards? It sure looks smelly and I want to know whether that's because I missed something, or because the format isn't really that important after all (which in the long run feels rather disturbing). I'm also interested in checking how many popular email clients actually discard the 'at' symbol.
Go by what the spec says, not by what some mail clients do.
So yes, a Content-Id header should have a value that conforms to the way the specification says and therefor should have an '#' symbol.
The world of email is a broken hell hole of many different mail clients and servers doing their own thing and not respecting the standards.
As someone who has written mail software for the past 17 years, I can assure you, this is not the only place that Google deviates from the specs.
I need to send some params to my web server witch contain special character like <*> into url.
Ex: http://localhost/mypage/X123*12362issasa.
This special character i will used it for some regexp.
When i try this i get this
"403 Forbidden You don't have permission to access
/mypage/X123*12362issasa. on this server."
apache_error.log contain this line :
[error] [client 127.0.0.1] (20025)The given path contained wildcard
characters: access to /mypage/X123*12362issasa failed
my .htaccess contains this lines:
Options +ExecCGI
AddHandler cgi-script .cgi .pl .py .php
DirectoryIndex mypage.pl
<IfModule mod_charset.c>
CharsetRecodeMultipartForms off
</IfModule>
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ mypage.pl?oid=$1
</IfModule>
Could someone please help me to proper configure .htaccess file in order to accept this kind of epecial characters ?
Any help will be apreciated.
Thanks.
Reasons you shouldn't use an Asterisk in your URL.
1 ) The asterisk is allowed in the URL unencoded according to the standard, but according to RFC 1738 Uniform Resource Locators (URL) it's a special character. So in this case it has a special use.
RFC 1738 Uniform Resource Locators (URL) December 1994
Reserved:
Many URL schemes reserve certain characters for a special meaning:
their appearance in the scheme-specific part of the URL has a
designated semantics. If the character corresponding to an octet is
reserved in a scheme, the octet must be encoded. The characters ";",
"/", "?", ":", "#", "=" and "&" are the characters which may be
reserved for special meaning within a scheme. No other characters may
be reserved within a scheme.
Usually a URL has the same interpretation when an octet is
represented by a character and when it encoded. However, this is not
true for reserved characters: encoding a character reserved for a
particular scheme may change the semantics of a URL.
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
On the other hand, characters that are not required to be encoded
(including alphanumerics) may be encoded within the scheme-specific
part of a URL, as long as they are not being used for a reserved
purpose.
Additionally in headers it's used for server-only declarations according to RFC 2068 HTTP 1.1.
9.2 OPTIONS
The OPTIONS method represents a request for information about the
communication options available on the request/response chain
identified by the Request-URI. This method allows the client to
determine the options and/or requirements associated with a resource,
or the capabilities of a server, without implying a resource action
or initiating a resource retrieval.
Unless the server's response is an error, the response MUST NOT
include entity information other than what can be considered as
communication options (e.g., Allow is appropriate, but Content-Type
is not). Responses to this method are not cachable.
If the Request-URI is an asterisk ("*"), the OPTIONS request is
intended to apply to the server as a whole. A 200 response SHOULD
include any header fields which indicate optional features
implemented by the server (e.g., Public), including any extensions
not defined by this specification, in addition to any applicable
general or response-header fields. As described in section 5.1.2, an
"OPTIONS *" request can be applied through a proxy by specifying the
destination server in the Request-URI without any path information.
If the Request-URI is not an asterisk, the OPTIONS request applies
only to the options that are available when communicating with that
resource. A 200 response SHOULD include any header fields which
indicate optional features implemented by the server and applicable
to that resource (e.g., Allow), including any extensions not defined
by this specification, in addition to any applicable general or
response-header fields. If the OPTIONS request passes through a
proxy, the proxy MUST edit the response to exclude those options
which apply to a proxy's capabilities and which are known to be
unavailable through that proxy.
2 ) It's a reserved character (sub delimiter) as of RFC 3986 URI Generic Syntax January 2005 (thanks to #Daxim for pointing this out).
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The purpose of reserved characters is to provide a set of
delimiting characters that are distinguishable from other data
within a URI. URIs that differ in the replacement of a reserved
character with its corresponding percent-encoded octet are not
equivalent. Percent- encoding a reserved character, or decoding a
percent-encoded octet that corresponds to a reserved character,
will change how the URI is interpreted by most applications. Thus,
characters in the reserved set are protected from normalization and
are therefore safe to be used by scheme-specific and
producer-specific algorithms for delimiting data subcomponents
within a URI.
3 ) In some of the host operating systems it's used as a wildcard.
Either way you should avoid using unencoded asterisks in your request URI.
I need to take an arbitrary string and make it safe to use as part of an email address, replacing characters if necessary. This is to submit text to an API that requires values to be encoded into the address the email is being sent to. However, this is fraught with pitfalls and I'd like to make sure that whatever is input isn't going to break mail delivery. I'm already planning on just double-quoting the entire local part, which is going to mitigate a lot of the issues.
I've found a guide to what characters can be used in the local part of an email address, but it seems to make no distinction between 'It's forbidden by the RFCs' and 'It can be confusing so it's best to stay away' and 'it can only be used when escaped properly'. Does anyone have a reference to something clearer/faster to read than the appropriate RFCs themselves?
Edit: I have no control over the parsing on the receiving end nor can I change how the text gets submitted as something other than a straight ASCII string.
The RFCs themselves are perfectly clear. Ignoring obsolete and quoted forms, the local part of an address is:
atext = ALPHA / DIGIT / ; Any character except controls,
"!" / "#" / ; SP, and specials.
"$" / "%" / ; Used for atoms
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
Assuming you control the API, why not Base64 encode the data?
Base64EncodeString("Hello")
Base64DecodeString("SGVsbG8=")
You should probably replace the padding character = with an email-safe character like - minus.
Edit
It seems = is a safe character, no need to replace it.
However, the resulting text may start with a number so pad it with a letter and have the receiver discard the padding.
Why don't you just confine the characters you use in the email address to the ones in the table you referenced that are marked "OK"?
That would include plus, minus, hyphen, period, upper and lower case letters, numeric digits, and the underscore.