Why do HTML forms convert "&" into "&"? - forms

I've been trying to figure out why my carefully prepared "&" phrases were being turned into plain "&" phrases. I knew it was happening, but I didn't know if it was happening when they were being submitted as part of an SQL query or somewhere else. This is quite tricky, since you have to View Source to see the difference!
I eventually discovered where it was happening - in the HTML form that was being submitted (action="post"). I had a <select> where one of the options contained the phrase:
<option value="sticks & stones">sticks & stones</option>
I found that when the form was submitted, the value had been changed to "sticks & stones", with the result that when the value was submitted in a database query, it failed to find any results.
I have further experimented and find this happens with text inputs and hidden inputs too.
My question is: WHY????? It seems a particularly silly thing to do.

Escape sequences have to be interpreted without knowing the author's intent. If I wanted a value like Foo " Bar, I couldn't say value="Foo " Bar" because the quotes wouldn't match. Instead, I'd have to use value="Foo " Bar". But then what if I literally want Foo " Bar? That's where & comes in. But to avoid ambiguity, the system has to always translate escape sequences. So if you want a literal &, you have to be explicit about it like sticks &amp; stones.

Related

String or string literal fails

I am trying something simple to follow exercises in a book. For example, typing “hello” at the prompt in the interactions window.
I get the the following error:
“a”: unbound identifier in module in: “a”
I believe simple things like this worked before, so I want to know what to check to resolve this problem.
Your problem are the quotation marks, a very common problem. Look:
“a”
The quotation marks look italic.
They should be like this: "a".
Copy paste this into your REPL and print return (this time it will work!):
"hello"
This is written with the right quoation marks "" and not “” .
If you copy paste from pdf books somteims this wrong quotation marks appear as a result (like Realm of Racket - because recently I had that problem when copy pasting from it). (Quotation marks from MS Word when using Times Romans fonts are also from this strange type, and in some programming blogs, too, the quotation marks are spoiled when copy pasting out of them).
How to avoid it?: Type the examples manually into the DrRacket editor. - problem solved! Plus you learn the things anyway much better if you type them yourself - ("the hard way" approach ;) ).
And you learn, that even copy pasting is a skill which one sometimes has to learn anew - welcome to programming (the long road of learning) :D .
Remember to enter the quotes " around the hello too.
"hello" is a string which contains the text hello
hello is a name of an variable (an identifier),
so if you haven't defined the name hello you get an
error saying that the identifier is undefined

What is &amp used for

Is there any difference in behaviour of below URL.
I don't know why the & is inserted, does it make any difference ?
www.testurl.com/test?param1=test&current=true
versus
www.testurl.com/test?param1=test&current=true
& is HTML for "Start of a character reference".
& is the character reference for "An ampersand".
&current; is not a standard character reference and so is an error (browsers may try to perform error recovery but you should not depend on this).
If you used a character reference for a real character (e.g. ™) then it (™) would appear in the URL instead of the string you wanted.
(Note that depending on the version of HTML you use, you may have to end a character reference with a ;, which is why &trade= will be treated as ™. HTML 4 allows it to be ommited if the next character is a non-word character (such as =) but some browsers (Hello Internet Explorer) have issues with this).
HTML doesn't recognize the & but it will recognize & because it is equal to & in HTML
I looked over this post someone had made: http://www.webmasterworld.com/forum21/8851.htm
My Source: http://htmlhelp.com/tools/validator/problems.html#amp
Another common error occurs when including a URL which contains an
ampersand ("&"):
This is invalid:
a href="foo.cgi?chapter=1&section=2&copy=3&lang=en"
Explanation:
This example generates an error for "unknown entity section" because
the "&" is assumed to begin an entity reference. Browsers often
recover safely from this kind of error, but real problems do occur in
some cases. In this example, many browsers correctly convert &copy=3
to ©=3, which may cause the link to fail. Since 〈 is the HTML
entity for the left-pointing angle bracket, some browsers also convert
&lang=en to 〈=en. And one old browser even finds the entity §,
converting &section=2 to §ion=2.
So the goal here is to avoid problems when you are trying to validate your website. So you should be replacing your ampersands with & when writing a URL in your markup.
Note that replacing & with & is only done when writing the URL in
HTML, where "&" is a special character (along with "<" and ">"). When
writing the same URL in a plain text email message or in the location
bar of your browser, you would use "&" and not "&". With HTML, the
browser translates "&" to "&" so the Web server would only see "&"
and not "&" in the query string of the request.
Hope this helps : )
That's a great example. When &current is parsed into a text node it is converted to ¤t. When parsed into an attribute value, it is parsed as &current.
If you want &current in a text node, you should write &current in your markup.
The gory details are in the HTML5 parsing spec - Named Character Reference State
if you're doing a string of characters.
make:
let linkGoogle = 'https://www.google.com/maps/dir/?api=1';
let origin = '&origin=' + locations[0][1] + ',' + locations[0][2];
aNav.href = linkGoogle + origin;

Diamonds with question marks

I'm getting these little diamonds with question marks in them in my HTML attributes when I present data from my database. I'm using EPiServer and a few custom properties.
This is the information I've gathered,
I save my data as a XML document, since I use custom EPiServer properties which need more than one defined value. This is saved as UTF8.
It's only attributes in element tags which have this problem, such as align=left becomes align=�left�. There is no " character there, but I get the diamonds anyway.
If I use " outside an element, it works and shows correctly.
Any clues?
This is a problem with your character encoding scheme.
I would recommend reading this article, where (close to the bottom of it), he shows you why you get that little diamond with question marks.
Has the XML been touched by any of the Microsoft Office suite products.
These are notorius for switching vanilla quotes (") x'22' to smartquotes x'93' and x'94'(“”).
Also singlequote (') is often converted from x'27' to x'91' and x'92' pairs (‘’).

How do I protect against cross-site scripting?

I am using php, mysql with smarty and I places where users can put comments and etc. I've already escaped characters before inserting into database for SQL Injection. What else do I need to do?
XSS is mostly about the HTML-escaping(*). Any time you take a string of plain text and put it into an HTML page, whether that text is from the database, directly from user input, from a file, or from somewhere else entirely, you need to escape it.
The minimal HTML escape is to convert all the & symbols to & and all the < symbols to <. When you're putting something into an attribute value you would also need to escape the quote character being used to delimit the attribute, usually " to ". It does no harm to always escape both quotes (" and the single quote apostrophe '), and some people also escape > to >, though this is only necessary for one corner case in XHTML.
Any good web-oriented language should provide a function to do this for you. For example in PHP it's htmlspecialchars():
<p> Hello, <?php htmlspecialchars($name); ?>! </p>
and in Smarty templates it's the escape modifier:
<p> Hello, {$name|escape:'html'}! </p>
really since HTML-escaping is what you want 95% of the time (it's relatively rare to want to allow raw HTML markup to be included), this should have been the default. Newer templating languages have learned that making HTML-escaping opt-in is a huge mistake that causes endless XSS holes, so HTML-escape by default.
You can make Smarty behave like this by changing the default modifiers to html. (Don't use htmlall as they suggest there unless you really know what you're doing, or it'll likely screw up all your non-ASCII characters.)
Whatever you do, don't fall into the common PHP mistake of HTML-escaping or “sanitising” for HTML on the input, before it gets processed or put in the database. This is the wrong place to be performing an output-stage encoding and will give you all sort of problems. If you want to validate your input to make sure it's what the particular application expects, then fine, but weeding out or escaping “special” characters at this stage is inappropriate.
*: Other aspects of XSS are present when (a) you actually want to allow users to post HTML, in which case you have to whittle it down to acceptable elements and attributes, which is a complicated process usually done by a library like HTML Purifier, and even then there have been holes. Alternative, simpler markup schemes may help. And (b) when you allow users to upload files, which is something very difficult to make secure.
In regards to SQL Injection, escaping is not enough - you should use data access libraries where possible and parameterized queries.
For XSS (cross site scripting), start with html encoding outputted data. Again, anti XSS libraries are your friend.
One current approach is to only allow a very limited number of tags in and sanitize those in the process (whitelist + cleanup).
You'll want to make sure people can't post JavaScript code or scary HTML in their comments. I suggest you disallow anything but very basic markup.
If comments are not supposed to contain any markup, doing a
echo htmlspecialchars($commentText);
should suffice, but it's very crude. Better would be to sanitize all input before even putting it in your database. The PHP strip_tags() function could get you started.
If you want to allow HTML comments, but be safe, you could give HTML Purifier a go.
You should not modify data that is entered by the user before putting it into the database. The modification should take place as you're outputting it to the website. You don't want to lose the original data.
As you're spitting it out to the website, you want to escape the special characters into HTML codes using something like htmlspecialchars("my output & stuff", ENT_QUOTES, 'UTF-8') -- make sure to specify the charset you are using. This string will be translated into my output & stuff for the browser to read.
The best way to prevent SQL injection is simply not to use dynamic SQL that accepts user input. Instead, pass the input in as parameters; that way it will be strongly typed and can't inject code.

What made many of the coding websites converting standard " into non standard ”?

This question is about standard double quote " and non-standard double quote “ & ”
Yesterday when I searched for some sample facebook serverfbml codes, and came upon to this
http://mahmudahsan.wordpress.com/2008/11/22/facebook-fbml-rendering-in-iframe-application/
okay so it has got what I want, so I copied the code to my project and run it... bah... lots of errors
Why? Because the site turned the standard double quote " inside his script into “ or ” ,
or single quote from ' into ’
This is not the first time I faced this problem when copying codes from the Internet, and I believe many of the code writers haven't expected that the site turned their single/double quotes into strange ones.
Any explanation to this strange phenomenon ?
edited: I notice the title converted my " into “ & ” too... let me edit it... oh and I failed
At least in the title or in the text, it looks much better to have typographic double quotes (i.e. is more pleasant to the eye). Coding sites should not do this for actual code, i.e. in StackOverflow code that is indented by four spaces. If a double quote in text is converted to typographic, it's fine.
This gets really worse when you paste typographic quotes into a console that tries to display the character and falls back to a standard quote, because the console font does not have a typographic quote. Because then it looks like it's a standard one, but it isn't. Not much you can do about it, other than use a code display plugin on your website that does not change code.
The problem is in the underlying blog engine. Wordpress does that by default, and there is AFAIK no way to turn it off (Without changing the code). Given the fact that there are only relatively few really great blog engines, there may not always be a choice to switch to something "better".
Also in the same category: Fancy dashes, aka. turning - into –
the source shows that the quote char is sometimes ”
that's the quote that is the good looking quote which will cause problem in a program.
i think either the WordPress text editor or storage/retrieval converted the ordinary quote into that one.
You can use the replace function in your program editor to replace those characters.