Is Encoding the same as Escaping? - encoding

I am interested in theory on whether Encoding is the same as Escaping? According to Wikipedia
an escape character is a character
which invokes an alternative
interpretation on subsequent
characters in a character sequence.
My current thought is that they are different. Escaping is when you place an escape charater in front of a metacharacter(s) to mark it/them as to behave differently than what they would have normally.
Encoding, on the other hand, is all about transforming data into another form, and upon wanting to read the original content it is decoded back to its original form.

Escaping is a subset of encoding: You only encode certain characters by prefixing a special character instead of transferring (typically all or many) characters to another representation.
Escaping examples:
In an SQL statement: ... WHERE name='O\' Reilly'
In the shell: ls Thirty\ Seconds\ *
Many programming languages: "\"Test\" string (or """Test""")
Encoding examples:
Replacing < with < when outputting user input in HTML
The character encoding, like UTF-8
Using sequences that do not include the desired character, like \u0061 for a

They're different, and I think you're getting the distinction correctly.
Encoding is when you transform between a logical representation of a text ("logical string", e.g. Unicode) into a well-defined sequence of binary digits ("physical string", e.g. ASCII, UTF-8, UTF-16). Escaping is a special character (typically the backslash: '\') which initiates a different interpretation of the character(s) following the escape character; escaping is necessary when you need to encode a larger number of symbols to a smaller number of distinct (and finite) bit sequences.

They are indeed different.
You pretty much got it right.

Related

Allowed characters in CSS 'content' property?

I've read that we must use Unicode values inside the content CSS property i.e. \ followed by the special character's hexadecimal number.
But what characters, other than alphanumerics, are actually allowed to be placed as is in the value of content property? (Google has no clue, hence the question.)
The rules for “escaping” characters are in the CSS 2.1 specification, clause 4.1.3 Characters and case. The special rules for quoted strings, as in content property value, are in clause 4.3.7 Strings. Within a quoted string, any character may appear as such, except for the character used to quote the string (" or '), a newline character, or a backslash character \.
The information that you must use \ escapes is thus wrong. You may use them, and may even need to use them if the character encoding of the document containing the style sheet does not let you enter all characters directly. But if the encoding is UTF-8, and is properly declared, then you can write content: '☺ Я Ω ⁴ ®'.
As far as I know, you can insert any Unicode character. (Here's a useful list of Unicode characters and their codes.)
To utilize these codes, you must escape them, like so:
U+27BA Becomes \27BA
Or, alternatively, I think you may just be able to escape the character itself:
content: '\➺';
Source: http://mathiasbynens.be/notes/css-escapes

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?
ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.
\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.
It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)

Unicode URL decoding

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)
But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?
Are 8-bit characters, that require encoding, preceded by %00?
Or, is the point that unicode characters are supposed to be lost/split?
According to Wikipedia:
Current standard
The generic URI syntax mandates that new URI schemes
that provide for the representation of
character data in a URI must, in
effect, represent characters from the
unreserved set without translation,
and should convert all other
characters to bytes according to
UTF-8, and then percent-encode those
values. This requirement was
introduced in January 2005 with the
publication of RFC 3986. URI schemes
introduced before this date are not
affected.
Not addressed by the current
specification is what to do with
encoded character data. For example,
in computers, character data manifests
in encoded form, at some level, and
thus could be treated as either binary
data or as character data when being
mapped to URI characters. Presumably,
it is up to the URI scheme
specifications to account for this
possibility and require one or the
other, but in practice, few, if any,
actually do.
Non-standard implementations
There exists a non-standard encoding
for Unicode characters: %uxxxx, where
xxxx is a Unicode value represented as
four hexadecimal digits. This behavior
is not specified by any RFC and has
been rejected by the W3C. The third
edition of ECMA-262 still includes an
escape(string) function that uses this
syntax, but also an encodeURI(uri)
function that converts to UTF-8 and
percent-encodes each octet.
So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?
What I've always done is first UTF-8 encode a Unicode string to make it a series of 8-bit characters before escaping any of those with %HH.
P.S. - I can only hope the non-standard implementations (%uxxxx) are few and far between.
Since URI's were introduced before unicode was around, or atleast in wide use, I imagine this is a very implementation specific question. UTF-8 encoding your text, then escaping that per normal sounds like the best idea, since that's completely backwards compatible with any ASCII/ANSI systems in place, though you might get the odd wierd character or two.
On the other end, to decode, you'd unescape your text, and get a UTF-8 string. If someone using an older system tries to send yours some data in ASCII/ANSI, there's no harm done, that's (almost) UTF-8 encoded already.

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.