.Net XML Serialization and Escaped or Encoded characters - xml-serialization

I'm using XML Serialization heavily in a web service (the contracts pass complex types as params). Recently I noticed that the .Net XML Serialization engine is escaping some of the well known 5 reserved characters that must be escaped when included within an element (<, >, &, ' and "). My first reaction was "good old .Net, always looking out for me".
But then I started experimenting and noticed it is only escaping the <, > and &, and for some reason not the apostrophy and double quotes. For example if I return this literal string in a field within a complex type from my service:
Bad:<>&'":Data
This is what is transferred over the wire (as seen from Fiddler):
Bad:<>&'":Data
Has anyone run into this or understand why this is? Is the serializer simply overlooking them or is there a reason for this? As I understand it the ' and " are not by spec valid within an xml element.

According to the XML spec, for regular content and markup:
& always needs to be escaped as & because it's the escape character
< always needs to be escaped as < since it determines the start of an element. It even has to be escaped within attributes as a safety and to make writing parser error detection simpler.
> does not need to be escaped as > but often is for symmetry with <
' needs to be escaped as &apos; only if in an attribute delimited by '
" needs to be escaped as " only if in an attribute delimited by "
Inside of processing instructions, comments and CDATA sections, the rules change some, but the details are in the 2.4 Character Data and Markup portion of the spec.
Your serializer is trying to do you a favor by keeping the file somewhat human-readable.
(Each of the above may also be escaped using their numeric equivalents.)

XMLSpy says you're wrong. The following is well-formed XML:
<root>
<data>'"</data>
</root>
Aside from "argument by reference to XMLSpy", a better argument is that the XML Serializer has been out in the wild for over seven years. In this time, I guarantee someone has tried to serialize "O'Brien" in a Name property. This bug would have been noticed by now.

Related

change '#' key in freemarker templates

In order to use if statements in Freemarker templates, the following syntax is used;
[#if ${numberCoupons} <= 1]
[#assign couponsText = 'coupon']
[/#if]
Is there a way to replace the '#' character with something else, because I am trying to integrate it with drools (a java based rule engine) and the '#' character is used to mark start of comments so the formatting breaks?
There isn't anything for that out of the box (it uses a JavaCC generated parser, which is static). But you can write a TemplateLoader that just delegates to another TemplateLoader, but replaces the Reader with a FilterReader that replaces [% and [/% and [%-- and --%] with [#, etc. Then then you can use % instead of # in the FreeMarker tags. (It's somewhat confusing though, as error messages will still use #, etc.)
As #ddekany wrote, you can write code that tranform the template without the pound sign, But notice it can clash with HTML or XML (and similar) tags, at least from an editor prespective.

Sanitizing HTML - Get Some Unknown Encoding?

I am using HtmlSanitizer to parse input in .NET Core prevent XSS Injections. HtmlSanitizer implements AngleSharp - I have no idea what Angle Sharp does, but it encodes some characters, like so:
Input:
!##$%^&*()_+{}:"<>?~
Output:
!##$%^&*()_+{}:"<>?~
Note that <, >, and & got encoded as <, >, and &amp, respectively. I have two questions here:
What is this encoding?
(Optional) Is there a way to use AngleSharp, or some other library, to undo it?
Side note - all the harmful stuff gets stripped out as needed, this format change happens on "safe" html anyway, just to point out that I am not undoing any security features of the library so we don't have a long discussion on that.
These strings are HTML encoded. The purpose of html encoding is to prevent XSS, but since I am already stripping any potentially harmful code, it's just overkill in my case. More detail can be found in this answer (quote copied from there):
HTML.Encode() - What/How does it prevent scripting security problems in ASP .NET?
The less-than character (<) is converted to <.
The greater-than character (>) is converted to >.
The ampersand character (&) is converted to &.
The double-quote character (") is converted to ".
Any ASCII code character whose code is greater-than or equal to 0x80
is converted to &#<number>, where
is the ASCII character value.
You can html encode and decode strings in .NET Core using a built in tool, as described here.

Differentiate properly escaped HTML metacharacters from improperly escaped ones

I'm working on a replacement for a desktop Java app, a single page app written in Scala and Lift.
I have this situation where some of data in the database has properly used HTML metacharacters, such as Unicode escape sequences for accented characters in non-English names. At the same time, I have other data with improper HTML metacharacters, such as ampersands in the names or organizations.
Good (don't escape): Universita\u0027
Bad (needs escape): Bob & Jim
How do I determine whether or not the data needs to be fixed before I send it to the client?
There are two ways to approach this. One is a function that takes a string and returns the index of any improperly escaped HTML metacharacters (which I can fix myself). Alternately it could be a function that takes a string and returns a string with the improperly escaped metacharacters fixed, and leaves the proper ones alone.

Regular expression to prevent SQL injection

I know I have to escape single quotes, but I was just wondering if there's any other character, or text string I should guard against
I'm working with mysql and h2 database...
If you check the MySQL function mysql-real-escape-string which is used by all upper level languages you'll see that the strange characters list is quite huge:
\
'
"
NUL (ASCII 0)
\n
\r
Control+Z
The upper language wrappers like the PHP one may also protect the strings from malformed unicode characters which may end up as a quote.
The conclusion is: do not escape strings, especially with hard-to-debug hard-to-read, hard-to-understand regular expressions. Use the built-in provided functions or use parameterized SQL queries (where all parameters cannot contain anything interpredted as SQL by the engine). This is also stated in h2 documentation: h2 db sql injection protection.
A simple solution for the problem above is to use a prepared statement:
This will somewhat depend on what type of information you need to obtain from the user. If you are only looking for simple text, then you might as well ignore all special characters that a user might input (if it's not too much trouble)--why allow the user to input characters that don't make sense in your query?
Some languages have functions that will take care of this for you. For example, PHP has the mysql_real_escape_string() function (http://php.net/manual/en/function.mysql-real-escape-string.php).
You are correct that single quotes (') are user input no-no's; but double quotes (") and backslashes (\) should also definitely be ignored (see the above link for which characters the PHP function ignores, since those are the most important and basic ones).
Hope this is at least a good start!

What is &amp used for

Is there any difference in behaviour of below URL.
I don't know why the & is inserted, does it make any difference ?
www.testurl.com/test?param1=test&current=true
versus
www.testurl.com/test?param1=test&current=true
& is HTML for "Start of a character reference".
& is the character reference for "An ampersand".
&current; is not a standard character reference and so is an error (browsers may try to perform error recovery but you should not depend on this).
If you used a character reference for a real character (e.g. ™) then it (™) would appear in the URL instead of the string you wanted.
(Note that depending on the version of HTML you use, you may have to end a character reference with a ;, which is why &trade= will be treated as ™. HTML 4 allows it to be ommited if the next character is a non-word character (such as =) but some browsers (Hello Internet Explorer) have issues with this).
HTML doesn't recognize the & but it will recognize & because it is equal to & in HTML
I looked over this post someone had made: http://www.webmasterworld.com/forum21/8851.htm
My Source: http://htmlhelp.com/tools/validator/problems.html#amp
Another common error occurs when including a URL which contains an
ampersand ("&"):
This is invalid:
a href="foo.cgi?chapter=1&section=2&copy=3&lang=en"
Explanation:
This example generates an error for "unknown entity section" because
the "&" is assumed to begin an entity reference. Browsers often
recover safely from this kind of error, but real problems do occur in
some cases. In this example, many browsers correctly convert &copy=3
to ©=3, which may cause the link to fail. Since 〈 is the HTML
entity for the left-pointing angle bracket, some browsers also convert
&lang=en to 〈=en. And one old browser even finds the entity §,
converting &section=2 to §ion=2.
So the goal here is to avoid problems when you are trying to validate your website. So you should be replacing your ampersands with & when writing a URL in your markup.
Note that replacing & with & is only done when writing the URL in
HTML, where "&" is a special character (along with "<" and ">"). When
writing the same URL in a plain text email message or in the location
bar of your browser, you would use "&" and not "&". With HTML, the
browser translates "&" to "&" so the Web server would only see "&"
and not "&" in the query string of the request.
Hope this helps : )
That's a great example. When &current is parsed into a text node it is converted to ¤t. When parsed into an attribute value, it is parsed as &current.
If you want &current in a text node, you should write &current in your markup.
The gory details are in the HTML5 parsing spec - Named Character Reference State
if you're doing a string of characters.
make:
let linkGoogle = 'https://www.google.com/maps/dir/?api=1';
let origin = '&origin=' + locations[0][1] + ',' + locations[0][2];
aNav.href = linkGoogle + origin;