What is the proper way of denoting URI with diacritics (letters with accents)? - perl

What is the correct and official way of using diacritics in URI?
I have 3 different ways shown below:
Here á = %E1, â = %E2, space = %20, comma = %2C, but this link doesn't work properly since the characters are mangled:
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=%2CBrasil%2CGoi%E1s%2CLuzi%E2nia%2CSanta%20Luzia%2CBatismos%201749-1753%2CImagens&image_name=_MG_5229.JPG
Here space = %20, comma = %2C and I don't do anything with the a's. This link works:
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=%2CBrasil%2CGoiás%2CLuziânia%2CSanta%20Luzia%2CBatismos%201749-1753%2CImagens&image_name=_MG_5229.JPG
Here space = +, comma = %2C and I don't do anything with the a's. This link works:
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=%2CBrasil%2CGoiás%2CLuziânia%2CSanta+Luzia%2CBatismos+1749-1753%2CImagens&image_name=_MG_5229.JPG

The characters in a URL string must be within in a restricted subset of 7-bit ASCII, and no encoding is specified for wide characters
Some of that set are unreserved, and may be used literally anywhere the syntax allows
The remaining characters are reserved because they form part of the URL syntax; reserved characters must be percent-encoded if they are used outside their syntactical meaning
Eight-bit characters that are in neither the reserved nor the unreserved categories must always be percent-encoded
##Unreserved characters
0 to 9
A to Z
a to z
-
.
_
~
##Reserved characters
! - %21
# - %23
$ - %24
& - %26
' - %27
( - %28
) - %29
* - %2A
+ - %2B
, - %2C
/ - %2F
: - %3A
; - %3B
= - %3D
? - %3F
# - %40
[ - %5B
] - %5D
This link doesn't work properly since the characters are mangled
That is a problem between the client and the server. It looks like you're sending ISO-8859-1 characters, in which scheme E1 and E2 correspond to e acute, and e circumflex. But if your server is expecting UTF-8 encoding then those should appear as byte sequences C3 A1 and C3 A2
I can't tell what encoding is expected by your server, but it clearly isn't what you're sending. The current standard is to encode non-ASCII characters in UTF-8 and percent-encode the resulting bytes
###Update
The best solution is to use the URI module, which will encode character string as necessary
Take special note that, if you need to use UTF-8-encoded characters in your source code, as below, then you must have use utf8 at the top of your program. You also need to make sure that your editor is writing UTF-8 data to the program file.
use utf8;
use strict;
use warnings 'all';
use feature 'say';
use URI;
my $url = URI->new('http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=,Brasil,Goiás,Luziânia,Santa Luzia,Batismos 1749-1753,Imagens&image_name=_MG_5229.JPG');
say $url;
###output
http://www.recordspreservation.org/cgi-bin/list_directory_1.cgi?directory=,Brasil,Goi%C3%A1s,Luzi%C3%A2nia,Santa%20Luzia,Batismos%201749-1753,Imagens&image_name=_MG_5229.JPG

Related

Mail subject decoding?

I have a email subject like this:
Subject: =?gbk?Q?=B3=F6=C3=C0=C1=E2=C7=BF=C1=A6=B3=E9=CA=AA=BB=FA=D2=BB=CC=A8?=
=?gbk?Q?=A3=AC=D6=E9=BA=A3=B9=E3=D6=DD=C9=FA=BB=EE=B1=D8=B1=B8?=
But I don't know what kind of encoding is this?
Could someone help? Newbie to email protocol.
This subject is encoded in GBK, an extension of the GB2312 character set for simplified Chinese characters, used in the People's Republic of China.
As defined in the RFC1342 specification, to represent non-ASCII text in Internet Message headers, you have to encode it with the MIME encoded-word syntax:
encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="
charset = token ; legal charsets defined by RFC 1341
encoding = token ; Either "B" or "Q"
token = 1*
tspecials = "(" / ")" / "<" / ">" / "#" / "," / ";" / ":" / "\" /
<"> / "/" / "[" / "]" / "?" / "." / "="
encoded-text = 1* (but see "Use of encoded-words in message
; headers", below)
The "B" encoding:
The "B" encoding is identical to the "BASE64" encoding defined by
RFC
1341.
The "Q" encoding:
The "Q" encoding is similar to the "Quoted-Printable" content-
transfer-encoding defined in RFC 1341. It is designed to allow text
containing mostly ASCII characters to be decipherable on an ASCII
terminal without decoding.
(1) Any 8-bit value may be represented by a "=" followed by two
hexadecimal digits. For example, if the character set in use
were ISO-8859-1, the "=" character would thus be encoded as
"=3D", and a SPACE by "=20". (Upper case should be used for
hexadecimal digits "A" through "F".)
(2) The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the ""
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
(3) 8-bit values which correspond to printable ASCII characters
other
than "=", "?", and "_" (underscore), MAY be represented as those
characters. (But see section 5 for restrictions.) In
particular, SPACE and TAB MUST NOT be represented as themselves
within encoded words.
In your subject:
Subject:
=?gbk?Q?=B3=F6=C3=C0=C1=E2=C7=BF=C1=A6=B3=E9=CA=AA=BB=FA=D2=BB=CC=A8?= =?gbk?Q?=A3=AC=D6=E9=BA=A3=B9=E3=D6=DD=C9=FA=BB=EE=B1=D8=B1=B8?=
We can see that the Quoted-Printable encoding has been used, hence the presence of = as escape character instead of %.
You can find an online encode here, and an online MIME Headers Decoder here.
Finally, here is your decoded subject:
Subject: 出美菱强力抽湿机一台,珠海广州生活必备

Amazon Signature Encoding

The Amazon documentation for "creating a signature" has some pretty specific requirements. In particular, it asks me to:
URL encode the parameter name and values according to the following rules:
Do not URL encode any of the unreserved characters that RFC 3986 defines. These unreserved characters are A-Z, a-z, 0-9, hyphen ( - ), underscore ( _ ), period ( . ), and tilde ( ~ ).
Percent encode all other characters with %XY, where X and Y are hex characters 0-9 and uppercase A-F.
Percent encode extended UTF-8 characters in the form %XY%ZA....
Percent encode the space character as %20 (and not +, as common encoding schemes do).
Does this encoding have a name?
I still don't know if the encoding has a name, but it is defined by RFC 3689. Once I knew that, finding a library was easy.

CAM::PDF returning non ascii character instead of quotes

I am having trouble with non ascii characters being returned. I am not sure at which level the issue resides. It could be the actual PDF encoding, the decoding used by CAM::PDF (which is FlateDecode) or CAM::PDF itself. The following returns a string full of the commands used to create the PDF (Tm, Tj, etc).
use CAM::PDF;
my $filename = "sample.pdf";
my $cam_obj = CAM::PDF->new($filename) or die "$CAM::PDF::errstr\n";
my $tree = $cam_obj->getPageContentTree(1);
my $page_string = $tree->toString();
print $page_string;
You can download sample.pdf here
The text returned in the Tj often has one character which is non ASCII. In the PDF, the actual character is almost always a quote, single or double.
While reproducing this I found that the returned character is consistent within the PDF but varies amongst PDFs. I also noticed the PDF is using a specific font file. I'm now looking into font files to see if the same character can be mapped to varying binary values.
:edit:
Regarding Windows-1252. My PDF returns an "Õ" instead of apostrophes. The Õ character is hex 0xD5 in Windows-1252 and UTF-8. If the idea is that the character is encoded with Windows-1252, then it should be a hex 0x91 or 0x92 which it is not. Which is why the following does nothing to the character:
use Encode qw(decode encode);
my $page_string = 'Õ';
my $characters = decode 'Windows-1252', $page_string;
my $octets = encode 'UTF-8', $characters;
open STS, ">TEST.txt";
print STS $octets . "\n";
I'm the author of CAM-PDF. Your PDF is non-compliant. From the PDF 1.7 specification, section 3.2.3 "String Objects":
"Within a literal string, the backslash (\) is used as an escape
character for various purposes, such as to include newline characters,
nonprinting ASCII characters, unbalanced parentheses, or the backslash
character itself in the string. [...] The \ddd escape sequence provides
a way to represent characters outside the printable ASCII character set."
If you have large quantities of non-ASCII characters, you can represent them using hexadecimal string notation.
EDIT: Perhaps my interpretation of the spec is incorrect, given a_note's alternative answer. I'll have to revisit this... Certainly, the spec could be clearer in this area.
Sorry to intrude, and with all due respect, sir, but file IS compliant. Section 3.2.3 further states:
[The \ddd] notation provides a way to specify characters outside the
7-bit ASCII character set by using ASCII characters only. However,
any 8-bit value may appear in a string.
"receiving" - where? You get "Õ" instead of expected what? And doing exactly what? You know that windows command prompt uses dos code page, not windows-1252, right? (oops, new thread again... probably i should register here :-) )

Check characters inside string for their Unicode value

I would like to replace characters with certain Unicode values in a variable with dash. I have two ideas which might work, but I do not know how to check for the value of character:
1/ processing variable as string, checking every characters value and placing these characters in a new variable (replacing those characters which are invalid)
2/ use these magic :-)
$variable = s/[$char_range]/-/g;
char_range should be similar to [0-9] or [A-Z], but it should be values for utf-8 characters. I need range from 0x00 to 0x7F to be exact.
The following expression should replace anything that is not ASCII with a hyphen, which is (I think) what you want to do:
s/[\N{U+0080}-\N{U+FFFF}]/-/g
There's no such thing as UTF-8 characters. There are only characters that you encode into UTF-8. Even then, you don't want to make ranges outside of the magical ones that Perl knows about. You're likely to get more than you expect.
To get the ordinal value for a character, use ord:
use utf8;
my $code_number = ord '😸'; # U+1F638
say sprintf "%#x", $code_number;
However, I don't think that's what you need. It sounds like you want to replace characters in the ASCII range with a -. You can specify ranges of code numbers:
s/[\000-\177]/-/g; # in octal
s/[\x00-\x7f]/-/g; # in hexadecimal
You can specify wide character ordinal values in braces:
s/[\x80-\x{10ffff}]/-/g; # wide characters, replace non-ASCII in this case
When the characters have a common property, you can use that:
s/\p{ASCII}/-/g;
However, if you are replacing things character for character, you might want a transliteration:
$string =~ tr/\000-\177/-/;

Unable to encode to iso-8859-1 encoding for some chars using Perl Encode module

I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding.
My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (’)
Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem.
I am pasting the sample text having the char causing the issue:
my $str = "This is a test string to test the encoding of some chars like ’ “ ” etc these are failing to encode; some of them which encode correctly are é « etc.";
Thanks
There's a third argument to encode, which controls the checking it does. The default is to use a substitution character, but you can set it to FB_CROAK to get an error message.
The fundamental problem is that the characters represented by ’, “, and ” do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.
Some possibilities:
Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.
Re-encode the entities outside the ISO-8859-1 range (plus &), before converting from utf-8 to ISO-8859-1:
my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);
(The no warnings bit is needed because U+10FFFF hasn't actually been assigned yet.)
There are other possibilities. It really depends on what you're trying to accomplish.