Wikipedia (MediaWiki) URI encoding scheme

Wikipedia (MediaWiki) URI encoding scheme - encoding

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.

http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.
They should have something like this in their LocalSettings.php:
$wgArticlePath = '/wiki/$1';
and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL
You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.

The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.
Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.
The logic looks something like this:
Decode character references (e.g. é)
Convert spaces to underscores
Check whether the title is a reference to a namespace or interwiki
Remove hash fragments (e.g. Apple#Name
Remove forbidden characters
Forbid subdirectory links (e.g. ../directory/page)
Forbid triple tilde sequences (~~~) (for some reason)
Limit the size to 255 bytes
Capitalise the first letter
Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.
I hope that helps!

Related

Hyphen, underscore, or camelCase as word delimiter in URIs?

I'm designing an HTTP-based API for an intranet app. I realize it's a pretty small concern in the grand scheme of things, but: should I use hyphens, underscores, or camelCase to delimit words in the URIs?
Here are my initial thoughts:
camelCase
possible issues if server is case-insensitive
seems to have fairly widespread use in query string keys (http://api.example.com?**searchQuery**=...), but not in other URI parts
Hyphen
more aesthetically pleasing than the other alternatives
seems to be widely used in the path portion of the URI
never seen hyphenated query string key in the wild
possibly better for SEO (this may be a myth)
Underscore
potentially easier for programming languages to handle
several popular APIs (Facebook, Netflix, StackExchange, etc.) are using underscores in all parts of the URI.
I'm leaning towards underscores for everything. The fact that most of the big players are using them is compelling (see https://stackoverflow.com/a/608458/360570).

You should use hyphens in a crawlable web application URL. Why? Because the hyphen separates words (so that a search engine can index the individual words), and a hyphen is not a word character. Underscore is a word character, meaning it should be considered part of a word.
Double-click this in Chrome: camelCase
Double-click this in Chrome: under_score
Double-click this in Chrome: hyphen-ated
See how Chrome (I hear Google makes a search engine too) only thinks one of those is two words?
camelCase and underscore also require the user to use the shift key, whereas hyphenated does not.
So if you should use hyphens in a crawlable web application, why would you bother doing something different in an intranet application? One less thing to remember.

The standard best practice for REST APIs is to have a hyphen, not camelcase or underscores.
This comes from Mark Masse's "REST API Design Rulebook" from Oreilly.
In addition, note that Stack Overflow itself uses hyphens in the URL: .../hyphen-underscore-or-camelcase-as-word-delimiter-in-uris
As does WordPress: http://inventwithpython.com/blog/2012/03/18/how-much-math-do-i-need-to-know-to-program-not-that-much-actually

Short Answer:
lower-cased words with a hyphen as separator
Long Answer:
What is the purpose of a URL?
If pointing to an address is the answer, then a shortened URL is also doing a good job. If we don't make it easy to read and maintain, it won't help developers and maintainers alike. They represent an entity on the server, so they must be named logically.
Google recommends using hyphens
Consider using punctuation in your URLs. The URL http://www.example.com/green-dress.html is much more useful to us than http://www.example.com/greendress.html. We recommend that you use hyphens (-) instead of underscores (_) in your URLs.
Coming from a programming background, camelCase is a popular choice for naming joint words.
But RFC 3986 defines URLs as case-sensitive for different parts of the URL.
Since URLs are case sensitive, keeping it low-key (lower cased) is always safe and considered a good standard. Now that takes a camel case out of the window.
Source: https://metamug.com/article/rest-api-naming-best-practices.html#word-delimiters

Whilst I recommend hyphens, I shall also postulate an answer that isn't on your list:
Nothing At All
My company's API has URIs like /quotationrequests/, /purchaseorders/ and so on.
Despite you saying it was an intranet app, you listed SEO as a benefit. Google does match the pattern /foobar/ in a URL for a query of ?q=foo+bar
I really hope you do not consider executing a PHP call to any arbitrary string the user passes in to the address bar, as #ServAce85 suggests!

In general, it's not going to have enough of an impact to worry about, particularly since it's an intranet app and not a general-use Internet app. In particular, since it's intranet, SEO isn't a concern, since your intranet shouldn't be accessible to search engines. (and if it is, it isn't an intranet app).
And any framework worth it's salt either already has a default way to do this, or is fairly easy to change how it deals with multi-word URL components, so I wouldn't worry about it too much.
That said, here's how I see the various options:
Hyphen
The biggest danger for hyphens is that the same character (typically) is also used for subtraction and numerical negation (ie. minus or negative).
Hyphens feel awkward in URL components. They seem to only make sense at the end of a URL to separate words in the title of an article. Or, for example, the title of a Stack Overflow question that is added to the end of a URL for SEO and user-clarity purposes.
Underscore
Again, they feel wrong in URL components. They break up the flow (and beauty/simplicity) of a URL, since they essentially add a big, heavy apparent space in the middle of a clean, flowing URL.
They tend to blend in with underlines. If you expect your users to copy-paste your URLs into MS Word or other similar text-editing programs, or anywhere else that might pick up on a URL and style it with an underline (like links traditionally are), then you might want to avoid underscores as word separators. Particularly when printed, an underlined URL with underscores tends to look like it has spaces in it instead of underscores.
CamelCase
By far my favorite, since it makes the URLs seem to flow better and doesn't have any of the faults that the previous two options do.
Can be slightly harder to read for people that have a hard time differentiating upper-case from lower-case, but this shouldn't be much of an issue in a URL, because most "words" should be URL components and separated by a / anyways. If you find that you have a URL component that is more than 2 "words" long, you should probably try to find a better name for that concept.
It does have a possible issue with case sensitivity, but most platforms can be adjusted to be either case-sensitive or case-insensitive. Any it's only really an issue for 2 cases: a.) humans typing the URL in, and b.) Programmers (since we are not human) typing the URL in. Typos are always a problem, regardless of case sensitivity, so this is no different that all one case.

It is recommended to use the spinal-case (which is highlighted by
RFC3986), this case is used by Google, PayPal, and other big
companies.
source:- https://blog.restcase.com/5-basic-rest-api-design-guidelines/
EDIT: Although the highlight on the RFC is nowhere to be found, the recommendation on spinal case is still valid (as already noted in other answers)

We should use hyphens in web page URLs to convince search engines to index each keyword in the URL separately.

here's the best of both worlds.
I also "like" underscores, besides all your positive points about them, there is also a certain old-school style to them.
So what I do is use underscores and simply add a small rewrite rule to your Apache's .htaccess file to re-write all underscores to hyphens.
https://yoast.com/apache-rewrite-dash-underscore/

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)

<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)

Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

URL Shortening: What's the best encoding to use?

I'm adding a feature to my project where we are generating links to internal stuff of our website, and we want these links to be as short as possible, so we'll be making our own "URL Shortener".
I'm wondering what's the best encoding / alphabet to use for the generated short URLs.
This is largely a subjective question, I'd like to know what your opinions are regarding the best approach / trade-off.
Several options I've thought of:
- Digits, uppercase + lowercase (base 62)
- Digits, only lowercase (base 36)
- Base 32 (http://www.crockford.com/wrmg/base32.html)
- linkpot.net (using common short english words)
Of course, the second two are better for uses other than clicking, and the first two are better for Twitter.
Also, if I'm going with "clickable-only" URLs, I'd like to make the alphabet as large as possible, adding other symbols.
What symbols can I use in URLs that won't get URL encoded?
What symbols should I use? Could some of these prove problematic? I'm thinking slash and dot, for example.
What do you think?
NOTE: The main target for these URLs is Twitter. Keeping this in mind, we should probably have the largest alphabet possible, since most people will be clicking. However, I'm interested in your experience with people using short URLs in other ways (over the phone, in printed paper, etc). How likely is it this could happen?
NOTE 2: I'm not making "yet another URL shortener", please don't condemn me with downvotes. We are generating short URLs for internal stuff in our site, not allowing anyone to shorten any URL. Imagine Google Maps giving you short URLs when you generate a link to a specific coordinate.

I would go with Base-62, it's the shortest. Shortened URL is not meant for someone to manually enter anyway so don't worry about case-sensitivity.

If these are "clickable only URLS" I'd probably go with a base-64 encoding. MIME's base-64 uses a couple of characters you shouldn't use, but there are enough unreserved safe characters in URLs that you can just swap them out. (Also, you don't need the padding that MIME's base-64 uses, since you know when your URL ends.)
Here's a page that discusses one way to do this.
You can look at RFC2396 to figure out exactly what characters are safe in URIs if you want to double check.

I'd be curious to know a little more about the implementation. How will these URLs be "unshortened", or will the internal pages being accessed be saved as shortened URLs? In either case, even if you went with the encoding set of [A-Z] you'd be able to reference 26 * 26 * 26 = 17,576 pages with only 3 characters; how many internal web pages are you talking about?
In general I would lean on what your use case requirements are for picking the right encoding set. Are you planning on having these links available for "uses other than clicking"? What would those uses be, and how do you suspect they'll alter the encoding? (For example, using parts of the URL as a file name on a case-insensitive file system reduces the available character set.)
Here's an informative page on the character set you have available to you when writing a URL.

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.
Some interesting examples:
The heart character.
If I type this into my browser:
http://www.google.com/search?q=♥
Then copy and paste it, I see this URL
http://www.google.com/search?q=%E2%99%A5
which makes it seem like Firefox (or Safari) is doing this.
urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'
which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.
…
If I type the URL
http://www.google.com/search?q=…
into my browser then copy and paste, I get
http://www.google.com/search?q=%E2%80%A6
back. Which seems to be the result of doing
urllib.quote_plus(x.encode("utf-8"))
which makes sense since … can't be encoded with Latin-1.
But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.
Since this seems to be ambiguous:
In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'
works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.
What's the right thing to be doing with the special characters I need to deal with?

I would always encode in UTF-8. From the Wikipedia page on percent encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.

The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.
If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).
The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.
It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)

IRI (RFC 3987) is the latest standard that replaces the URI/URL (RFC 3986 and older) standards. URI/URL do not natively support Unicode (well, RFC 3986 adds provisions for future URI/URL-based protocols to support it, but does not update past RFCs). The "%uXXXX" scheme is a non-standard extension to allow Unicode in some situations, but is not universally implemented by everyone. IRI, on the other hand, fully supports Unicode, and requires that text be encoded as UTF-8 before then being percent-encoded.

IRIs do not replace URIs, because only URIs (effectively, ASCII) are permissible in some contexts -- including HTTP.
Instead, you specify an IRI and it gets transformed into a URI when going out on the wire.

The first question is what are your needs? UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages. In regards to the browser identifying the encoding, the response (from the web server) should tell the browser the encoding. Still most browsers will attempt to guess, because this is either missing or wrong in so many cases. They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding. Currently all browser(? I did not check this, but it is pretty close to true) use utf-8 as the default.
So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes.

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.

There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.

When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.

Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.

You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat

There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)