The Amazon documentation for "creating a signature" has some pretty specific requirements. In particular, it asks me to:
URL encode the parameter name and values according to the following rules:
Do not URL encode any of the unreserved characters that RFC 3986 defines. These unreserved characters are A-Z, a-z, 0-9, hyphen ( - ), underscore ( _ ), period ( . ), and tilde ( ~ ).
Percent encode all other characters with %XY, where X and Y are hex characters 0-9 and uppercase A-F.
Percent encode extended UTF-8 characters in the form %XY%ZA....
Percent encode the space character as %20 (and not +, as common encoding schemes do).
Does this encoding have a name?
I still don't know if the encoding has a name, but it is defined by RFC 3689. Once I knew that, finding a library was easy.
Related
.. as the title says, in UTF-8 is there any multi-byte character containing the byte \x27 / chr(39) / ' / single-quote-character ?
you may wonder why anyone would want to know that?
well, when trying to bypass the function
function quoteLinuxShellArgument(string $argument): string {
if(false!==strpos($argument,"\x00")){error it is impossible to quote null bytes in shell arguments}
return "'" . str_replace ( "'", "'\\''", $argument ) . "'";
}
among my first questions was the one in the title.. is there any?
In UTF-8, any Unicode codepoint that is outside of the ASCII range (U+0000 - U+007F) is required to be encoded using multiple bytes. All of those bytes will have their high bit set to 1.
So no, byte 0x27 (b00100111) will never appear in a multi-byte sequence. 0x27 can only ever be used to encode codepoint U+0027 APOSTROPHE as a single byte.
All of the multi-byte UTF-8 characters have the upper bit set, so there's no chance of colliding with a regular ASCII character. That includes your single quote.
What is the correct and official way of using diacritics in URI?
I have 3 different ways shown below:
Here á = %E1, â = %E2, space = %20, comma = %2C, but this link doesn't work properly since the characters are mangled:
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=%2CBrasil%2CGoi%E1s%2CLuzi%E2nia%2CSanta%20Luzia%2CBatismos%201749-1753%2CImagens&image_name=_MG_5229.JPG
Here space = %20, comma = %2C and I don't do anything with the a's. This link works:
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=%2CBrasil%2CGoiás%2CLuziânia%2CSanta%20Luzia%2CBatismos%201749-1753%2CImagens&image_name=_MG_5229.JPG
Here space = +, comma = %2C and I don't do anything with the a's. This link works:
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=%2CBrasil%2CGoiás%2CLuziânia%2CSanta+Luzia%2CBatismos+1749-1753%2CImagens&image_name=_MG_5229.JPG
The characters in a URL string must be within in a restricted subset of 7-bit ASCII, and no encoding is specified for wide characters
Some of that set are unreserved, and may be used literally anywhere the syntax allows
The remaining characters are reserved because they form part of the URL syntax; reserved characters must be percent-encoded if they are used outside their syntactical meaning
Eight-bit characters that are in neither the reserved nor the unreserved categories must always be percent-encoded
##Unreserved characters
0 to 9
A to Z
a to z
-
.
_
~
##Reserved characters
! - %21
# - %23
$ - %24
& - %26
' - %27
( - %28
) - %29
* - %2A
+ - %2B
, - %2C
/ - %2F
: - %3A
; - %3B
= - %3D
? - %3F
# - %40
[ - %5B
] - %5D
This link doesn't work properly since the characters are mangled
That is a problem between the client and the server. It looks like you're sending ISO-8859-1 characters, in which scheme E1 and E2 correspond to e acute, and e circumflex. But if your server is expecting UTF-8 encoding then those should appear as byte sequences C3 A1 and C3 A2
I can't tell what encoding is expected by your server, but it clearly isn't what you're sending. The current standard is to encode non-ASCII characters in UTF-8 and percent-encode the resulting bytes
###Update
The best solution is to use the URI module, which will encode character string as necessary
Take special note that, if you need to use UTF-8-encoded characters in your source code, as below, then you must have use utf8 at the top of your program. You also need to make sure that your editor is writing UTF-8 data to the program file.
use utf8;
use strict;
use warnings 'all';
use feature 'say';
use URI;
my $url = URI->new('http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=,Brasil,Goiás,Luziânia,Santa Luzia,Batismos 1749-1753,Imagens&image_name=_MG_5229.JPG');
say $url;
###output
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=,Brasil,Goi%C3%A1s,Luzi%C3%A2nia,Santa%20Luzia,Batismos%201749-1753,Imagens&image_name=_MG_5229.JPG
How to use # as a special character for name field in YANG file.
I am using type as a string which help me to accept all ASCII special characters from keyboard except #
Is # is some kind of a Keyword or carrying some special meaning for YANG modeling language?
I'm assuming your issue happens during YANG modeling, not during instance document validation.
No, # character does not have special meaning in YANG modules. You are most likely attempting to use this character in a YANG identifier, which is not valid. YANG identifiers, such as statement arguments to container, leaf, leaf-list and list have to follow this grammar:
;; An identifier MUST NOT start with (('X'|'x') ('M'|'m') ('L'|'l'))
identifier = (ALPHA / "_")
*(ALPHA / DIGIT / "_" / "-" / ".")
ALPHA = %x41-5A / %x61-7A
; A-Z / a-z
DIGIT = %x30-39
; 0-9
The first character must be an underscore or a letter, and may be followed by letters, digits, underscores, dots and hyphens. An identifier must also not start with xml regardless of letter case.
I have a email subject like this:
Subject: =?gbk?Q?=B3=F6=C3=C0=C1=E2=C7=BF=C1=A6=B3=E9=CA=AA=BB=FA=D2=BB=CC=A8?=
=?gbk?Q?=A3=AC=D6=E9=BA=A3=B9=E3=D6=DD=C9=FA=BB=EE=B1=D8=B1=B8?=
But I don't know what kind of encoding is this?
Could someone help? Newbie to email protocol.
This subject is encoded in GBK, an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.
As defined in the RFC1342 specification, to represent non-ASCII text in Internet Message headers, you have to encode it with the MIME encoded-word syntax:
encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="
charset = token ; legal charsets defined by RFC 1341
encoding = token ; Either "B" or "Q"
token = 1*
tspecials = "(" / ")" / "<" / ">" / "#" / "," / ";" / ":" / "\" /
<"> / "/" / "[" / "]" / "?" / "." / "="
encoded-text = 1* (but see "Use of encoded-words in message
; headers", below)
The "B" encoding:
The "B" encoding is identical to the "BASE64" encoding defined by
RFC
1341.
The "Q" encoding:
The "Q" encoding is similar to the "Quoted-Printable" content-
transfer-encoding defined in RFC 1341. It is designed to allow text
containing mostly ASCII characters to be decipherable on an ASCII
terminal without decoding.
(1) Any 8-bit value may be represented by a "=" followed by two
hexadecimal digits. For example, if the character set in use
were ISO-8859-1, the "=" character would thus be encoded as
"=3D", and a SPACE by "=20". (Upper case should be used for
hexadecimal digits "A" through "F".)
(2) The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the ""
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
(3) 8-bit values which correspond to printable ASCII characters
other
than "=", "?", and "_" (underscore), MAY be represented as those
characters. (But see section 5 for restrictions.) In
particular, SPACE and TAB MUST NOT be represented as themselves
within encoded words.
In your subject:
Subject:
=?gbk?Q?=B3=F6=C3=C0=C1=E2=C7=BF=C1=A6=B3=E9=CA=AA=BB=FA=D2=BB=CC=A8?= =?gbk?Q?=A3=AC=D6=E9=BA=A3=B9=E3=D6=DD=C9=FA=BB=EE=B1=D8=B1=B8?=
We can see that the Quoted-Printable encoding has been used, hence the presence of = as escape character instead of %.
You can find an online encode here, and an online MIME Headers Decoder here.
Finally, here is your decoded subject:
Subject: 出美菱强力抽湿机一台,珠海广州生活必备
Is a Base64 encoded string completely alphanumeric except for the "=" at the end?
As there are only 26 letters in the alphabet, and ten digits, you only have 26+26+10=62 distinct alpanumeric characters. As base64 obviously needs 64, two additional characters are needed. These two are + and slash /. Additionally, as you said, = is used as padding at the end of message, if necessary.