Sanitizing HTML - Get Some Unknown Encoding? - encoding

I am using HtmlSanitizer to parse input in .NET Core prevent XSS Injections. HtmlSanitizer implements AngleSharp - I have no idea what Angle Sharp does, but it encodes some characters, like so:
Input:
!##$%^&*()_+{}:"<>?~
Output:
!##$%^&*()_+{}:"<>?~
Note that <, >, and & got encoded as <, >, and &amp, respectively. I have two questions here:
What is this encoding?
(Optional) Is there a way to use AngleSharp, or some other library, to undo it?
Side note - all the harmful stuff gets stripped out as needed, this format change happens on "safe" html anyway, just to point out that I am not undoing any security features of the library so we don't have a long discussion on that.

These strings are HTML encoded. The purpose of html encoding is to prevent XSS, but since I am already stripping any potentially harmful code, it's just overkill in my case. More detail can be found in this answer (quote copied from there):
HTML.Encode() - What/How does it prevent scripting security problems in ASP .NET?
The less-than character (<) is converted to <.
The greater-than character (>) is converted to >.
The ampersand character (&) is converted to &.
The double-quote character (") is converted to ".
Any ASCII code character whose code is greater-than or equal to 0x80
is converted to &#<number>, where
is the ASCII character value.
You can html encode and decode strings in .NET Core using a built in tool, as described here.

Related

When using UTF-8, is it better to reference character for international use using decimal or hex... and why?

When using UTF-8, which character reference is better, or more widely supported worldwide on various browsers... using decimal references or hex references?
UPDATE
For instance, for replacing quotation marks...
" or "
which one is better to use, and why?
All HTML entities use only the ASCII subset, so the fact that you encode your document in UTF-8, as opposed to any other byte oriented encoding which extends ASCII, is unrelated.
Anyway:
When using UTF-8, you can just copy and paste the relevant characters into the document, without references at all. E.g. StackOverflow does not convert this ⫅ to an entity (see the source of this page).
If you prefer using entities, then I would use the hex references purely since this is the way Unicode codepoints are usually written in the charts. References are so widely supported that I do not think that you will head a compatibility problem with neither hex nor decimal references.
There is no functional difference between decimal references and hexadecimal references. Old browsers did not support the latter, but then we are talking about really old browsers like Netscape 4 and IE 4.
Hexadecimal references are usually more handy, because in character code standards and other reference works, characters are referred to by their code numbers in hexadecimal. Using them, you avoid the conversion from hexadecimal to decimal (and thereby may avoid some mistakes).
There is no reason to use either " or " in text. (In attribute values, they, or ", are needed in rare cases.)
This does not depend on the document encoding (UTF-8 or something else), except in the sense that when using UTF-8, you do not need the references (except for the markup-significant characters < and &). UTF-8 lets you enter any character as such, though you might still use references if you find that more comfortable than finding an editor that lets you enter the characters themselves.

Flex(lexer) support for unicode

I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
More:
Use regular expression to match ANY Chinese character in utf-8 encoding
At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
肖晗 { printf ("xiaohan\n"); }
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:
[肖晗] { printf ("xiaohan/2\n"); }
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).
For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
This isn't pretty, but it should work.
Flex does not support Unicode. However, Flex supports "8 bit clean" binary input. Therefore you can write lexical patterns which match UTF-8. You can use these patterns in specific lexical areas of the input language, for instance identifiers, comments or string literals.
This will work for well for typical programming languages, where you may be able to assert to the users of your implementation that the source language is written in ASCII/UTF-8 (and no other encoding is supported, period).
This approach won't work if your scanner must process text that can be in any encoding. It also won't work (very well) if you need to express lexical rules specifically for Unicode elements. I.e. you need Unicode characters and Unicode regexes in the scanner itself.
The idea is that you can recognize a pattern which includes UTF-8 bytes using a lex rule, (and then perhaps take the yytext, and convert it out of UTF-8 or at least validate it.)
For a working example, see the source code of the TXR language, in particular this file: http://www.kylheku.com/cgit/txr/tree/parser.l
Scroll down to this section:
ASC [\x00-\x7f]
ASCN [\x00-\t\v-\x7f]
U [\x80-\xbf]
U2 [\xc2-\xdf]
U3 [\xe0-\xef]
U4 [\xf0-\xf4]
UANY {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UANYN {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
UONLY {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
As you can see, we can define patterns to match ASCII characters as well as UTF-8 start and continuation bytes. UTF-8 is a lexical notation, and this is a lexical analyzer generator, so ... no problem!
Some explanations: The UANY means match any character, single-byte ASCII or multi-byte UTF-8. UANYN means like UANY but no not match the newline. This is useful for tokens that do not break across lines, like say a comment from # to the end of the line, containing international text. UONLY means match only a UTF-8 extended character, not an ASCII one. This is useful for writing a lex rule which needs to exclude certain specific ASCII characters (not just newline) but all extended characters are okay.
DISCLAIMER: Note that the scanner's rules use a function called utf8_dup_from to convert the yytext to wide character strings containing Unicode codepoints. That function is robust; it detects problems like overlong sequences and invalid bytes and properly handles them. I.e. this program is not relying on these lex rules to do the validation and conversion, just to do the basic lexical recognition. These rules will recognize an overlong form (like an ASCII code encoded using several bytes) as valid syntax, but the conversion function will treat them properly. In any case, I don't expect UTF-8 related security issues in the program source code, since you have to trust source code to be running it anyway (but data handled by the program may not be trusted!) If you're writing a scanner for untrusted UTF-8 data, take care!
I am wondering if the newest version of flex supports unicode?
If so, how can use patterns to match Chinese characters?
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++.
RE/flex safely supports the full Unicode standard and accepts UTF-8, UTF-16, and UTF-32 input files without requiring UTF-8 hacks (that can't even support UTF-16/32 input and handle UTF BOM.)
Also, UTF-8 hacks with Flex don't allow you to write Unicode regular expressions such as [肖晗] that are fully supported in RE/flex.
It works seamlessly with Bison to build lexers and parsers.
In fact, with RE/flex we can write any Unicode patterns as UTF-8-based regular expressions in lexer .l specifications, such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
This generates a lexer that scans UTF-8, UTF-16, and UTF-32 files automatically. As per UTF standardization, for UTF-16/32 input a UTF BOM is expected in the input, while an UTF-8 BOM is optional.
We can use global %option unicode to enable Unicode and %option flex to specify Flex specifications. A local modifier (?u:) can be used to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

Is Encoding the same as Escaping?

I am interested in theory on whether Encoding is the same as Escaping? According to Wikipedia
an escape character is a character
which invokes an alternative
interpretation on subsequent
characters in a character sequence.
My current thought is that they are different. Escaping is when you place an escape charater in front of a metacharacter(s) to mark it/them as to behave differently than what they would have normally.
Encoding, on the other hand, is all about transforming data into another form, and upon wanting to read the original content it is decoded back to its original form.
Escaping is a subset of encoding: You only encode certain characters by prefixing a special character instead of transferring (typically all or many) characters to another representation.
Escaping examples:
In an SQL statement: ... WHERE name='O\' Reilly'
In the shell: ls Thirty\ Seconds\ *
Many programming languages: "\"Test\" string (or """Test""")
Encoding examples:
Replacing < with < when outputting user input in HTML
The character encoding, like UTF-8
Using sequences that do not include the desired character, like \u0061 for a
They're different, and I think you're getting the distinction correctly.
Encoding is when you transform between a logical representation of a text ("logical string", e.g. Unicode) into a well-defined sequence of binary digits ("physical string", e.g. ASCII, UTF-8, UTF-16). Escaping is a special character (typically the backslash: '\') which initiates a different interpretation of the character(s) following the escape character; escaping is necessary when you need to encode a larger number of symbols to a smaller number of distinct (and finite) bit sequences.
They are indeed different.
You pretty much got it right.

Unicode URL decoding

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)
But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?
Are 8-bit characters, that require encoding, preceded by %00?
Or, is the point that unicode characters are supposed to be lost/split?
According to Wikipedia:
Current standard
The generic URI syntax mandates that new URI schemes
that provide for the representation of
character data in a URI must, in
effect, represent characters from the
unreserved set without translation,
and should convert all other
characters to bytes according to
UTF-8, and then percent-encode those
values. This requirement was
introduced in January 2005 with the
publication of RFC 3986. URI schemes
introduced before this date are not
affected.
Not addressed by the current
specification is what to do with
encoded character data. For example,
in computers, character data manifests
in encoded form, at some level, and
thus could be treated as either binary
data or as character data when being
mapped to URI characters. Presumably,
it is up to the URI scheme
specifications to account for this
possibility and require one or the
other, but in practice, few, if any,
actually do.
Non-standard implementations
There exists a non-standard encoding
for Unicode characters: %uxxxx, where
xxxx is a Unicode value represented as
four hexadecimal digits. This behavior
is not specified by any RFC and has
been rejected by the W3C. The third
edition of ECMA-262 still includes an
escape(string) function that uses this
syntax, but also an encodeURI(uri)
function that converts to UTF-8 and
percent-encodes each octet.
So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?
What I've always done is first UTF-8 encode a Unicode string to make it a series of 8-bit characters before escaping any of those with %HH.
P.S. - I can only hope the non-standard implementations (%uxxxx) are few and far between.
Since URI's were introduced before unicode was around, or atleast in wide use, I imagine this is a very implementation specific question. UTF-8 encoding your text, then escaping that per normal sounds like the best idea, since that's completely backwards compatible with any ASCII/ANSI systems in place, though you might get the odd wierd character or two.
On the other end, to decode, you'd unescape your text, and get a UTF-8 string. If someone using an older system tries to send yours some data in ASCII/ANSI, there's no harm done, that's (almost) UTF-8 encoded already.