Should I use accented characters in URLs? - unicode

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.

There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.

When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.

Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.

You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat

There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

Related

Hyphen, underscore, or camelCase as word delimiter in URIs?

I'm designing an HTTP-based API for an intranet app. I realize it's a pretty small concern in the grand scheme of things, but: should I use hyphens, underscores, or camelCase to delimit words in the URIs?
Here are my initial thoughts:
camelCase
possible issues if server is case-insensitive
seems to have fairly widespread use in query string keys (http://api.example.com?**searchQuery**=...), but not in other URI parts
Hyphen
more aesthetically pleasing than the other alternatives
seems to be widely used in the path portion of the URI
never seen hyphenated query string key in the wild
possibly better for SEO (this may be a myth)
Underscore
potentially easier for programming languages to handle
several popular APIs (Facebook, Netflix, StackExchange, etc.) are using underscores in all parts of the URI.
I'm leaning towards underscores for everything. The fact that most of the big players are using them is compelling (see https://stackoverflow.com/a/608458/360570).
You should use hyphens in a crawlable web application URL. Why? Because the hyphen separates words (so that a search engine can index the individual words), and a hyphen is not a word character. Underscore is a word character, meaning it should be considered part of a word.
Double-click this in Chrome: camelCase
Double-click this in Chrome: under_score
Double-click this in Chrome: hyphen-ated
See how Chrome (I hear Google makes a search engine too) only thinks one of those is two words?
camelCase and underscore also require the user to use the shift key, whereas hyphenated does not.
So if you should use hyphens in a crawlable web application, why would you bother doing something different in an intranet application? One less thing to remember.
The standard best practice for REST APIs is to have a hyphen, not camelcase or underscores.
This comes from Mark Masse's "REST API Design Rulebook" from Oreilly.
In addition, note that Stack Overflow itself uses hyphens in the URL: .../hyphen-underscore-or-camelcase-as-word-delimiter-in-uris
As does WordPress: http://inventwithpython.com/blog/2012/03/18/how-much-math-do-i-need-to-know-to-program-not-that-much-actually
Short Answer:
lower-cased words with a hyphen as separator
Long Answer:
What is the purpose of a URL?
If pointing to an address is the answer, then a shortened URL is also doing a good job. If we don't make it easy to read and maintain, it won't help developers and maintainers alike. They represent an entity on the server, so they must be named logically.
Google recommends using hyphens
Consider using punctuation in your URLs. The URL http://www.example.com/green-dress.html is much more useful to us than http://www.example.com/greendress.html. We recommend that you use hyphens (-) instead of underscores (_) in your URLs.
Coming from a programming background, camelCase is a popular choice for naming joint words.
But RFC 3986 defines URLs as case-sensitive for different parts of the URL.
Since URLs are case sensitive, keeping it low-key (lower cased) is always safe and considered a good standard. Now that takes a camel case out of the window.
Source: https://metamug.com/article/rest-api-naming-best-practices.html#word-delimiters
Whilst I recommend hyphens, I shall also postulate an answer that isn't on your list:
Nothing At All
My company's API has URIs like /quotationrequests/, /purchaseorders/ and so on.
Despite you saying it was an intranet app, you listed SEO as a benefit. Google does match the pattern /foobar/ in a URL for a query of ?q=foo+bar
I really hope you do not consider executing a PHP call to any arbitrary string the user passes in to the address bar, as #ServAce85 suggests!
In general, it's not going to have enough of an impact to worry about, particularly since it's an intranet app and not a general-use Internet app. In particular, since it's intranet, SEO isn't a concern, since your intranet shouldn't be accessible to search engines. (and if it is, it isn't an intranet app).
And any framework worth it's salt either already has a default way to do this, or is fairly easy to change how it deals with multi-word URL components, so I wouldn't worry about it too much.
That said, here's how I see the various options:
Hyphen
The biggest danger for hyphens is that the same character (typically) is also used for subtraction and numerical negation (ie. minus or negative).
Hyphens feel awkward in URL components. They seem to only make sense at the end of a URL to separate words in the title of an article. Or, for example, the title of a Stack Overflow question that is added to the end of a URL for SEO and user-clarity purposes.
Underscore
Again, they feel wrong in URL components. They break up the flow (and beauty/simplicity) of a URL, since they essentially add a big, heavy apparent space in the middle of a clean, flowing URL.
They tend to blend in with underlines. If you expect your users to copy-paste your URLs into MS Word or other similar text-editing programs, or anywhere else that might pick up on a URL and style it with an underline (like links traditionally are), then you might want to avoid underscores as word separators. Particularly when printed, an underlined URL with underscores tends to look like it has spaces in it instead of underscores.
CamelCase
By far my favorite, since it makes the URLs seem to flow better and doesn't have any of the faults that the previous two options do.
Can be slightly harder to read for people that have a hard time differentiating upper-case from lower-case, but this shouldn't be much of an issue in a URL, because most "words" should be URL components and separated by a / anyways. If you find that you have a URL component that is more than 2 "words" long, you should probably try to find a better name for that concept.
It does have a possible issue with case sensitivity, but most platforms can be adjusted to be either case-sensitive or case-insensitive. Any it's only really an issue for 2 cases: a.) humans typing the URL in, and b.) Programmers (since we are not human) typing the URL in. Typos are always a problem, regardless of case sensitivity, so this is no different that all one case.
It is recommended to use the spinal-case (which is highlighted by
RFC3986), this case is used by Google, PayPal, and other big
companies.
source:- https://blog.restcase.com/5-basic-rest-api-design-guidelines/
EDIT: Although the highlight on the RFC is nowhere to be found, the recommendation on spinal case is still valid (as already noted in other answers)
We should use hyphens in web page URLs to convince search engines to index each keyword in the URL separately.
here's the best of both worlds.
I also "like" underscores, besides all your positive points about them, there is also a certain old-school style to them.
So what I do is use underscores and simply add a small rewrite rule to your Apache's .htaccess file to re-write all underscores to hyphens.
https://yoast.com/apache-rewrite-dash-underscore/

How are non-roman Unicode characters encoded in a domain name?

Say I have a URL such as:
http://ほっけがおいしい.com
If I put this in any browser, I insto-magically get:
http://xn--n8jaqhy3b1euj.com/
What is the algorithm to transcode the Unicode characters into mere latin ones? This seems like it should be easily Google-able but I really can't seem to find anything.
I want to reverse it -- given the latter, I want to get the former.
The use case is that I want to pass some information on the iPhone between apps using URL handlers, but I can't guarantee that the content will be Latin characters.
I'm not sure if this covers it all, I've not read through all the RFCs but it might be a good place to start: RFCs Related to IDN

URL Shortening: What's the best encoding to use?

I'm adding a feature to my project where we are generating links to internal stuff of our website, and we want these links to be as short as possible, so we'll be making our own "URL Shortener".
I'm wondering what's the best encoding / alphabet to use for the generated short URLs.
This is largely a subjective question, I'd like to know what your opinions are regarding the best approach / trade-off.
Several options I've thought of:
- Digits, uppercase + lowercase (base 62)
- Digits, only lowercase (base 36)
- Base 32 (http://www.crockford.com/wrmg/base32.html)
- linkpot.net (using common short english words)
Of course, the second two are better for uses other than clicking, and the first two are better for Twitter.
Also, if I'm going with "clickable-only" URLs, I'd like to make the alphabet as large as possible, adding other symbols.
What symbols can I use in URLs that won't get URL encoded?
What symbols should I use? Could some of these prove problematic? I'm thinking slash and dot, for example.
What do you think?
NOTE: The main target for these URLs is Twitter. Keeping this in mind, we should probably have the largest alphabet possible, since most people will be clicking. However, I'm interested in your experience with people using short URLs in other ways (over the phone, in printed paper, etc). How likely is it this could happen?
NOTE 2: I'm not making "yet another URL shortener", please don't condemn me with downvotes. We are generating short URLs for internal stuff in our site, not allowing anyone to shorten any URL. Imagine Google Maps giving you short URLs when you generate a link to a specific coordinate.
I would go with Base-62, it's the shortest. Shortened URL is not meant for someone to manually enter anyway so don't worry about case-sensitivity.
If these are "clickable only URLS" I'd probably go with a base-64 encoding. MIME's base-64 uses a couple of characters you shouldn't use, but there are enough unreserved safe characters in URLs that you can just swap them out. (Also, you don't need the padding that MIME's base-64 uses, since you know when your URL ends.)
Here's a page that discusses one way to do this.
You can look at RFC2396 to figure out exactly what characters are safe in URIs if you want to double check.
I'd be curious to know a little more about the implementation. How will these URLs be "unshortened", or will the internal pages being accessed be saved as shortened URLs? In either case, even if you went with the encoding set of [A-Z] you'd be able to reference 26 * 26 * 26 = 17,576 pages with only 3 characters; how many internal web pages are you talking about?
In general I would lean on what your use case requirements are for picking the right encoding set. Are you planning on having these links available for "uses other than clicking"? What would those uses be, and how do you suspect they'll alter the encoding? (For example, using parts of the URL as a file name on a case-insensitive file system reduces the available character set.)
Here's an informative page on the character set you have available to you when writing a URL.

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.
Some interesting examples:
The heart character.
If I type this into my browser:
http://www.google.com/search?q=♥
Then copy and paste it, I see this URL
http://www.google.com/search?q=%E2%99%A5
which makes it seem like Firefox (or Safari) is doing this.
urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'
which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.
…
If I type the URL
http://www.google.com/search?q=…
into my browser then copy and paste, I get
http://www.google.com/search?q=%E2%80%A6
back. Which seems to be the result of doing
urllib.quote_plus(x.encode("utf-8"))
which makes sense since … can't be encoded with Latin-1.
But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.
Since this seems to be ambiguous:
In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'
works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.
What's the right thing to be doing with the special characters I need to deal with?
I would always encode in UTF-8. From the Wikipedia page on percent encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.
The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.
If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).
The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.
It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)
IRI (RFC 3987) is the latest standard that replaces the URI/URL (RFC 3986 and older) standards. URI/URL do not natively support Unicode (well, RFC 3986 adds provisions for future URI/URL-based protocols to support it, but does not update past RFCs). The "%uXXXX" scheme is a non-standard extension to allow Unicode in some situations, but is not universally implemented by everyone. IRI, on the other hand, fully supports Unicode, and requires that text be encoded as UTF-8 before then being percent-encoded.
IRIs do not replace URIs, because only URIs (effectively, ASCII) are permissible in some contexts -- including HTTP.
Instead, you specify an IRI and it gets transformed into a URI when going out on the wire.
The first question is what are your needs? UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages. In regards to the browser identifying the encoding, the response (from the web server) should tell the browser the encoding. Still most browsers will attempt to guess, because this is either missing or wrong in so many cases. They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding. Currently all browser(? I did not check this, but it is pretty close to true) use utf-8 as the default.
So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes.

How do you troubleshoot character encoding problems?

If all you see is the ugly no-char boxes, what tools or strategies do you use to figure out what went wrong?
(The specific scenario I'm facing is no-char boxes within a <select> when it should be showing Japanese chars.)
Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.
Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.
So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?
Sorry to be so general, but the question doesn't give much more to work with.
If the data you send to the browser becomes mangled (moji-bake) you will get trash characters. Also, if you specify the wrong character set in your META headers, your browser will render the page incorrectly, causing moji-bake again, sometimes in random places on the page.
When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the lifetime of your program (data storage, retrieval, data manipulation in your code, displaying in the browsser etc...)
What is UTF8?
UTF8 handles binary streams of data, not strings. This means the bit combinations can have variable length. ASCII characters have a fixed length of 8 bits representing 1 byte, however UTF8 characters can be composed of 6bits, 8bits, 12bits, etc... As such, UTF8 is prone to what Japanese call "mojibake".
As a coder, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X).
Database Settings
If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their sense of humour!!
Checking your Codebase
In my experience editors like Notepad++, Notepad2, UltraEdit, e, etc... all have UTF8 support problems. They mostly work, but since their developers don't use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc ... all present problems.
I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://www.hidemaru.interlink.or.jp/software/
Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.
Manipulating Strings
Any string function need to multibyte safe. Notice I didn't say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!
META Tags
Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
Email
This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here.
Debugging UTF8 Issues
Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.
Hope that helps
Redirect the data to disk and use a Hex Editor. Most text editors / viewers do their own conversions behind the scenes, so it is difficult to be sure you are seeing the data in it's true form.