How to handle '&' in URL sent as HTML from iPhone Mail.app - iphone

Apologies if this has been answered already. There are similar topics but none that I could find pertaining to Cocoa & NSStrings...
I'm constructing a clickable URL to embed in an HTML email to be sent via the MFMailComposeViewController on the iPhone. i create the url then use stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding to polish up white space, etc. then add some surrounding HTML to get:
view
All's well so it's appended to emailBody. However once [mailComposer setMessageBody:emailBody isHTML:YES] all the & become & which isn't ideal within my URL.
can i control this? is there a better encoding algorithm? my HTML is a bit rusty perhaps I'm using the wrong encoding? I'm sure on the server I could parse the & back into & but looking for the Cocoa way...
Thanks!

Actually, & should always be encoded as & in HTML attributes. Including links. Including form value delimiters. So it's done exactly what you want, even though you didn't know you wanted it.
Look at it this way: in your URL, you have &age=53... That's interpreted first as a character entity, and only after that doesn't work is it interpreted as an ampersand followed by more character data.
The W3C spec is quite clear on this:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
That should settle it: use & not &.

Are you calling MFMailComposeViewController's
setMessageBody:isHTML:
and what do you set isHTML to?
Depending on it's setting it might very well be that MFMailComposeViewController is trying to help you out be encoding the entire message body...
Either don't encode the body yourself or make the entire body HTML.

Related

Is sanitizing html by removing angle brackets safe?

I want to sanitize a simple text field with a person's name, to protect from XSS and such. Stackoverflow pretty much says I must whitelist. I don't understand this. If I simply remove all < and > from the input value, or replace them with > and &ls;, does not that rule out code injection? Or am I missing something? Perhaps you only need to whitelist in more complex scenarios where you have to put up with angular brackets?
Sorry if it's a silly question, it's important to get this right.
Whether to whitelist or encode depends on how you want to use the text.
If you intend to treat the input as plain text, then encoding special characters is enough, and any HTML code entered will display as text only as long as you are careful not to allow unencoded text to end up anywhere in your HTML output. (This includes making sure any other systems you interface with don’t inappropriately use the unencoded text.)
If you want to allow some markup in the input, such as text styling or links, then you must whitelist the tags that you allow and get rid of all others.
No, it's not sufficient because if you were to include the person's name in an html attribute, you would also need to escape any double-quotes contained therein.

What is &amp used for

Is there any difference in behaviour of below URL.
I don't know why the & is inserted, does it make any difference ?
www.testurl.com/test?param1=test&current=true
versus
www.testurl.com/test?param1=test&current=true
& is HTML for "Start of a character reference".
& is the character reference for "An ampersand".
&current; is not a standard character reference and so is an error (browsers may try to perform error recovery but you should not depend on this).
If you used a character reference for a real character (e.g. ™) then it (™) would appear in the URL instead of the string you wanted.
(Note that depending on the version of HTML you use, you may have to end a character reference with a ;, which is why &trade= will be treated as ™. HTML 4 allows it to be ommited if the next character is a non-word character (such as =) but some browsers (Hello Internet Explorer) have issues with this).
HTML doesn't recognize the & but it will recognize & because it is equal to & in HTML
I looked over this post someone had made: http://www.webmasterworld.com/forum21/8851.htm
My Source: http://htmlhelp.com/tools/validator/problems.html#amp
Another common error occurs when including a URL which contains an
ampersand ("&"):
This is invalid:
a href="foo.cgi?chapter=1&section=2&copy=3&lang=en"
Explanation:
This example generates an error for "unknown entity section" because
the "&" is assumed to begin an entity reference. Browsers often
recover safely from this kind of error, but real problems do occur in
some cases. In this example, many browsers correctly convert &copy=3
to ©=3, which may cause the link to fail. Since 〈 is the HTML
entity for the left-pointing angle bracket, some browsers also convert
&lang=en to 〈=en. And one old browser even finds the entity §,
converting &section=2 to §ion=2.
So the goal here is to avoid problems when you are trying to validate your website. So you should be replacing your ampersands with & when writing a URL in your markup.
Note that replacing & with & is only done when writing the URL in
HTML, where "&" is a special character (along with "<" and ">"). When
writing the same URL in a plain text email message or in the location
bar of your browser, you would use "&" and not "&". With HTML, the
browser translates "&" to "&" so the Web server would only see "&"
and not "&" in the query string of the request.
Hope this helps : )
That's a great example. When &current is parsed into a text node it is converted to ¤t. When parsed into an attribute value, it is parsed as &current.
If you want &current in a text node, you should write &current in your markup.
The gory details are in the HTML5 parsing spec - Named Character Reference State
if you're doing a string of characters.
make:
let linkGoogle = 'https://www.google.com/maps/dir/?api=1';
let origin = '&origin=' + locations[0][1] + ',' + locations[0][2];
aNav.href = linkGoogle + origin;

Wikipedia (MediaWiki) URI encoding scheme

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.
They should have something like this in their LocalSettings.php:
$wgArticlePath = '/wiki/$1';
and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL
You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.
The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.
Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.
The logic looks something like this:
Decode character references (e.g. é)
Convert spaces to underscores
Check whether the title is a reference to a namespace or interwiki
Remove hash fragments (e.g. Apple#Name
Remove forbidden characters
Forbid subdirectory links (e.g. ../directory/page)
Forbid triple tilde sequences (~~~) (for some reason)
Limit the size to 255 bytes
Capitalise the first letter
Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.
I hope that helps!

Apostrophe issue in RTF

I have a function within a custom CRM web application (old VB.Net circa 2003) that takes a set of fields from a database and merges them with palceholders in a set of RTF based template documents. These generate merged letters and documentation. The code essentially loops through each line of the RTF template file and replaces any instances of the placeholder values with text from a database record. The issue I'm having is that users have pasted a certain type of apostrophe into the web app (and therefore into the database) that is not rendering correctly in the resulting RTF file. It is rendering like this - ’.
I need a way to spot this invalid apostrophe in the code and replace it with a valid one. Unfortunately when I paste the invalid apostrophe into the Visual Studio editor it gets converted into the correct one. So I need another way to express this invalid apostrophe's value. Unfortunately I do not know a great deal about unicode and other encodings so I'm calling out for help with this.
Any ideas?
If you really just want to figure out what the character is you might want to try and paste it into a text editor like ultraedit. It has a hex mode that you can flip to to see the actual underlying bytes.
In order to do the replace once you've figured out the character you'd do something like this in Vb,
text.Replace(ChrW(2001), "'")
Note that you might not be able to figure it out easily using the text editor because it might also get mangled by paste from the clipboard. You might want to either print some debug of the ascii values from code. You can use the AscW function to do that.
I can't help but think that it may actually simply be a case of specifying the correct encoding to use when you write out the stream though. Assuming you're using a StreamWriter you can specify it on the constructor. I'm guessing you actually want ASCII given your requirement.
oWriter = New System.IO.StreamWriter(path, False, System.Text.Encoding.ASCII)
It looks like you probably want to encode characters out of the 8 bit range (>255).
You can do that using \uNNNN according to the wikipedia article.

What kind of text code is %62%69%73%68%6F%70?

On a specific webpage, when I hover over a link, I can see the text as "bishop" but when I copy-and-paste the link to TextPad, it shows up as "%62%69%73%68%6F%70". What kind of code is this, and how can I convert it into text?
Thanks!
URL encoding, I think.
You can decode it here: http://meyerweb.com/eric/tools/dencoder/
Most programming languages will have functions to urlencode/decode too.
This is URL encoding. It is designed to pass characters like < / or & through a URL using their ASCII values in hex after a %. However, you can also use this for characters that don't need encoding per se. Makes the URL harder to read, which is sometimes desirable.
URL encoding replaces characters outside the ascii set.
More info about URL encoding in the w3schools site.
As mentioned by others, this is simply an ASCII representation of the text so that it can be passed around the HTTP object easily. If you've ever noticed typing in a website URL that has a space in it, the browser will usually convert that to %20. That's the hexadecimal value for the "space" character in ASCII.
This used to be a way to trick old spam scrapers. One way spammers get email addresses is to scrape the source code of websites for strings matching the pattern "username#company.tld". By encoding just the username portion or the whole string as ASCII characters, the string would be readable by humans, but would require the scraper to convert it to a literal string before it could be used to send emails. Of course, modern-day spamming tools account for these sort of strings.