Which Unicode characters are allowed in IDN host labels?

Which Unicode characters are allowed in IDN host labels? - unicode

I’m currently working on a “proper” URI validator, and currently it all comes down to hostname validation; the rest isn’t that tricky.
I’m stuck on IDN hostname labels (i.e., containing Unicode; possible punycode encoded strings have been decoded at this point).
My first idea was basically one regex for TLDs which don’t support IDNs and one for those which do. This could perhaps be based on Mozilla’s list of IDN-enabled TLDs. Respectively,
^[a-zA-Z0-9\-]+$ and ^[a-zA-Z0-9\-\p{L}]+$. However, this is not an ideal situation, since every IDN registrar can decide which characters to allow.
What I’m looking for is a proper, consistent, up to date data table of the Unicode characters allowed in various TLDs. It’s beginning to look like I have to find all the data myself at Russian and Chinese registry sites (which is quite difficult).
So before I go trying to gather all this data myself, I wondered whether such a list already exists. Or are there better approaches, best/common practices, etc.? (I want the validation to be as strict as possible.)

IANA maintains a list of all of the codepoints and their status at https://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties
All of the ones marked PVALID are safe to use. The ones marked CONTEXTO or CONTEXTJ have more rules to follow. Read RFC5892 (IDNA) and RFC6452 (changing the status of a couple of characters) for all of the gory details.

Can't you convert all Unicode domains to punycode and validate that? Since DNS doesn't support real UTF-8 chars anyways, this might be the best solution.

Related

Is there a list of Unicode character categories included/excluded in IDNs?

I want to make a regex that would match an IDN using Unicode categories (.NET engine). The spoofing prevention is not essential for my goals, so confusing characters don't have to be excluded.
I found some lists of individual characters (e.g. https://www.icann.org/en/system/files/files/idna-protocol-2003-2008.txt), however I want character categories so I wouldn't have to update when a new Unicode version comes out.

All the characters mentioned herein "The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)" with the character status "PVALID". Also, the characters with status as "CONTEXTJ" and "CONTEXTO" are valid in certain contexual conditions.
If one wants to dive deeper into the topic, go through the documentation released by Universal Acceptance Steering Group are worth looking at.

How are non-roman Unicode characters encoded in a domain name?

Say I have a URL such as:
http://ほっけがおいしい.com
If I put this in any browser, I insto-magically get:
http://xn--n8jaqhy3b1euj.com/
What is the algorithm to transcode the Unicode characters into mere latin ones? This seems like it should be easily Google-able but I really can't seem to find anything.
I want to reverse it -- given the latter, I want to get the former.
The use case is that I want to pass some information on the iPhone between apps using URL handlers, but I can't guarantee that the content will be Latin characters.

I'm not sure if this covers it all, I've not read through all the RFCs but it might be a good place to start: RFCs Related to IDN

What restrictions should I impose on usernames [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What restrictions should I impose on usernames? why?
What restrictions should I not impose on usernames? why?
P.S. db is via best-practice PDO so no risk of sql injection
Thanks

OK, so let's assume you're doing all your string-encoding tasks right. You've not got any SQL injections, HTML injections, or places where you're not URL-encoding something you should. So we don't need to worry about characters like "<&%\ being magic in some contexts. And you're using UTF-8 for everything so all of Unicode is in play. What other reasons are there to limit usernames?
To start with, all control characters, for sanity. There is no reason to have characters U+0000 to U+001F or U+007F to U+009F in a username.
Next, deny or normalise unexpected whitespace. You may want to allow a space in a username, but you almost certainly don't want to allow leading spaces, trailing spaces, or more than one space in a row. They may render the same in HTML, but are probably a user error that will confuse.
If you intend to allow that username to be used to login through HTTP Basic Authentication, you must disallow the : character, because the Basic Auth scheme encodes a ‘username:password’ pair with no escaping if there's a colon in the username or password. So at least one of the username and password must have the colon excluded, and it's better that that's the username because restricting people's choice of passwords is a much worse thing than usernames.
For Basic Authentication you may also want to disable all non-ASCII characters, as they are handled differently by different browsers. IE encodes them using the system codepage; Firefox encodes them using ISO-8859-1; Opera encodes them using UTF-8. Users should at least be warned before choosing non-ASCII names if HTTP Auth is going to be available, as actually using them will be very unreliable.
Next consider other Unicode control sequences, things like the bidi overrides and other characters listed there are unsuitable for use in markup. Probably you are going to end up putting them in markup and you don't want someone with an RLO in their name to turn a load of the text in your page backwards.
Also, if you allow Unicode do normalisation on the strings you get. Otherwise someone may have a username with a composed o-umlaut character ö, and wonder why they can't log in on a Mac, which by default would use a separate o character followed by combining umlaut. It's usual to normalise to the composed form NFC on the web. You may also want to do compatibility decompositions by using the form NFKC; this would allow a user Chris to log in from a Japanese keyboard in fullwidth romaji mode typing Ｃｈｒｉｓ. These are general issues it is good to solve for all your webapp's input, but for identifiers like usernames it can be more critical to get right.
Finally, make sure the length is OK to fit in the database without a silent truncation changing the name, especially if you are storing as UTF-8 bytes which you don't want to get snipped halfway through a byte sequence. Username truncations can also be a security issue in general.
If you are using usernames as a unique means of identification, you have much more to worry about: the already-mentioned problem of lookalikes such as Сhris (with a Cyrillic Es С). There are too many of these for you to handle reasonably; either restrict to ASCII or have an additional means of identifying users. (Or don't care, like SO doesn't; when I can easily call myself Chris anyway I have no need to call myself С-hris.)

Depends on many things, for instance, if the users are going to have their own URL, you want to be careful that someone who creates the username "%41llan" doesn't clash with the user called "Allan", while allowing forward-slash may cause problems. Look out for those sorts of constraints.

I've never seen the point in adding restrictions to usernames. If your code is resistant to sql injection attacks then let them put in anything they want.
The only restriction I'd add is a max length one so that it can be stored in a DB table
Let them use any Unicode character in their username.
Adding restrictions on the allowed characters will probably just annoy people using a non-ascii language.

SQL injection protection is a must, but that should probably be in your code, not in username restrictions. Certain characters should definitely be escaped, like \, %, etc.
It will on what kind of site you're running, but I think some obscene word restrictions would make your site look more professional no matter what. If someone sees that people are allowed to go around with "EXPLETIVE" as they're username, your site will look childish. Its like allowing teenagers to run rampid in your book store IMHO. You probably don't need to get much more picky than that, although its completely up to you.
This is slightly off topic, but as another piece of username advice, a great feature of any website is allowing users to change they're username over time. You can just have a number as a primary key, and allowing them to do this can save a lot of whining and people creating new accounts because they wanted to change their username. :D

Why use Unicode if your program is English only?

So I've read Joel's article, and looked through SO, and it seems the only reason to switch from ASCII to Unicode is for internationalization. The company I work for, as a policy, will only release software in English, even though we have customers throughout the world. Since all of our customers are scientists, they have functional enough English to use our software as a non-native speaker. Or so the logic goes. Because of this policy, there is no pressing need to switch to Unicode to support other languages.
However, I'm starting a new project and wanted to use Unicode (because that is what a responsible programmer is supposed to do, right?). In order to do so, we would have to start converting all of the libraries we've written into Unicode. This is no small task.
If internationalization of the programs themselves is not considered a valid reason, how would one justify all the time spent recoding libraries and programs to make the switch to Unicode?

This obviously depends on what your app actually does, but just because you only have an english version in no way means that internationalization is not an issue.
What if I want to store a customer name which uses non-english characters? Or the name of a place in another country?
As an added bonus (since you say you're targeting scientists) is that all sorts of scientific symbols and notiations are supported as part of Unicode.
Ultimately, I find it much easier to be consistent. Unicode behaves the same no matter whose computer you run the app on. Non-unicode means that you use some locale-dependant character set or codepage by default, and so text that looks fine on your computer may be full of garbage characters on someone else's.
Apart from that, you probably don't need to translate all your libraries to Unicode in one go. Write wrappers as needed to convert between Unicode and whichever encoding you use otherwise.
If you use UTF-8 for your Unicode text, you even get the ability to read plain ASCII strings, which should save you some conversion headaches.

They say they will always put it in English now, but you admit you have worldwide clients. A client comes in and says internationalization is a deal breaker, will they really turn them down?
To clarify the point I'm trying to make you say that they will not accept this reasoning, but it is sound.
Always better to be safe than sorry, IMO.

The extended Scientific, Technical and Mathematical character set rules.
Where else can you say ⟦∀c∣c∈Unicode⟧ and similar technical stuff.

Characters beyond the 7-bit ASCII range are useful in English as well. Does anyone using your software even need to write the € sign? Or £? How about distinguishing "résumé" from "resume"?You say it's used by scientists around the world, who may have names like "Jörg" or "Guðmundsdóttir". In a scientific setting, it is useful to talk about wavelengths like λ, units like Å, or angles as Θ, even in English.
Some of these characters, like "ö", "£", and "€" may be available in 8-bit encodings like ISO-8859-1 or Windows-1252, so it may seem like you could just use those encodings and be done with it. The problem is that there are characters outside of those ranges that many people use very frequently, and so lots of existing data is encoded in UTF-8. If your software doesn't understand that when importing data, it may interpret the "£" character in UTF-8 as a sequence of 2 Windows-1252 characters, and render it as "Â£". If this sort of error goes undetected for long enough, you can start to get your data seriously garbled, as multiple passes of misinterpretation alter your data more and more until it becomes unrecoverable.
And it's good to think about these issues early on in the design of your program. Since strings tend to be very low-level concept that are threaded throughout your entire program, with lots of assumptions about how they work implicit in how they are used, it can be very difficult and expensive to add Unicode support to a program later on if you have never even thought about the issue to begin with.
My recommendation is to always use Unicode capable string types and libraries wherever possible, and make sure any tests you have (whether they be unit, integration, regression, or any other sort of tests) that deal with strings try passing some Unicode strings through your system to ensure that they work and come through unscathed.
If you don't handle Unicode, then I would recommend ensuring that all data accepted by the system is 7-bit clean (that is, there are no characters beyond the 7-bit US-ASCII range). This will help avoid problems with incompatibilities between 8-bit legacy encodings like the ISO-8859 family and UTF-8.

Suppose your program allows me to put my name in it, on a form, a dialog, whatever, and my name can't be written with ascii characters... Even though your program is in English, the data may be in other language...

It doesn't matter that your software is not translated, if your users use international characters then you need to support unicode to be able to do correct capitalization, sorting, etc.

If you have no business need to switch to unicode, then don't do it. I'm basing this on the fact that you thought you'd need to change code unrelated to component you already need to change to make it all work with Unicode. If you can make the component/feature you're working on "Unicode ready" without spreading code churn to lots of other components (especially other components without good test coverage) then go ahead and make it unicode ready. But don't go churn your whole codebase without business need.
If the business need arises later, address it then. Otherwise, you aren't going to need it.
People in this thread may suppose scenarios where it becomes a business requirement. Run those scenarios by your product managers before considering them scenarios worth addressing. Make sure they know the cost of addressing them when you ask.

Well for one, your users might know and understand english, but they can still have 'local' names. If you allow your users to do any kind of input to your application, they might want to use characters that are not part of ascii. If you don't support unicode, you will have no way of allowing these names. You'd be forcing your users to adopt a more simple name just because the application isn't smart enough to handle special characters.
Another thing is, even if the standard right now is that the app will only be released in English, you are also blocking the possibility of internationalization with ASCII, adding to the work that needs to be done when the company policy decides that translations are a good thing. Company policy is good, but has also been known to change.

I'd say this attitude expressed naïveté, but I wouldn't be able to spell naïveté in ASCII-only.
ASCII still works for some computer-only codes, but is no good for the façade between machine and user.
Even without the New Yorker's old-fashioned style of coöperation, how would some poor woman called Zoë cope if her employers used such a system?
Alas, she wouldn't even seek other employment, as updating her résumé would be impossible, and she'd have to resume instead. How's she going to explain that to her fiancée?

The company I work for, **as a policy**, will only release software in English, even though we have customers throughout the world.
1 reason only: Policies change, and when they change, they will break your existing code. Period.
Design for evil, and you have a chance of not breaking your code so soon. In this case, use Unicode. Happened to me on a brazilian specific stock-market legacy system.

Many languages (Java [and thus most JVM-based language implementations], C# [and thus most .NET-based language implementatons], Objective C, Python 3, ...) support Unicode strings by preference or even (nearly) exclusively (you have to go out of your way to work with "strings" of bytes rather than of Unicode characters).
If the company you work for ever intends to use any of these languages and platforms, it would therefore be quite advisable to start planning a Unicode-support strategy; a pilot project in particular might not be a bad idea.

That's a really good question. The only reason I can think of that has nothing to do with I18n or non-English text is that Unicode is particularly suited to being what might be called a hub character set. If you think of your system as a hub with its external dependencies as spokes, you want to isolate character encoding conversions to the spokes, so that your hub system works consistently with your chosen encoding. What makes Unicode a ideal character set for the hub of your system is that it acknowledges the existence of other character sets, it defines equivalences between its own characters and characters in those external character sets, and there's an ongoing process where it extends itself to keep up with the innovation and evolution of external character sets. There are all sorts of weird encodings out there: even when the documentation assures you that the external system or library is using plain ASCII it often turns out to be some variant like IBM775 or HPRoman8, and the nice thing about Unicode is that no matter what encoding is thrown at you, there's a good chance that there's a table on unicode.org that defines exactly how to convert that data into Unicode and back out again without losing information. Then again, equivalents of a-z are fairly well-defined in every character set, so if your data really is restricted to the standard English alphabet, ASCII may do just as well as a hub character set.
A decision on encoding is a decision on two things - what set of characters are permitted and how those characters are represented. Unicode permits you to use pretty much any character ever invented, but you may have your own reasons not to want or need such a wide choice. You might still restrict usernames, for example, to combinations of a-z and underscore, maybe because you have to put them into an external LDAP system whose own character set is restricted, maybe because you need to print them out using a font that doesn't cover all of Unicode, maybe because it closes off the security problems opened up by lookalike characters. If you're using something like ASCII or ISO8859-1, the storage/transmission layer implements a lot of those restrictions; with Unicode the storage layer doesn't restrict anything so you might have to implement your own rules at the application layer. This is more work - more programming, more testing, more possible system states. The tradeoff for that extra work is more flexibility, application-level rules being easier to change than system encodings.

The reason to use unicode is to respect proper abstractions in your design.
Just get used to treating the concept of text properly. It is not hard. There's no reason to create a broken design even if your users are English.

Just think of a customer wanting to use names like Schrödingers Cat for files he saved using your software. Or imagine some localized Windows with a translation of My Documents that uses non-ASCII characters. That would be internationalization that has, though you don't support internationalization at all, have effects on your software.
Also, having the option of supporting internationalization later is always a good thing.

Internationalization is so much more than just text in different languages. I bet it's the niche of the future in the IT-world. Heck, it already is. A lot has already been said, just thought I would add a small thing. Even though your customers right now are satisfied with english, that might change in the future. And the longer you wait, the harder it will be to convert your code base. They might even today have problems with e.g. file names or other types of data you save/load in your application.

Unicode is like cooties. Once it "infects" one area, it's usually hard to contain it given interconnectedness of dependencies. Sooner or later, you'll probably have to tie in a library that is unicode compliant and thus will use wchar_t's or the like. Instead of marshaling between character types, it's nice to have consistent strings throughout.
Thus, it's nice to be consistent. Otherwise you'll end up with something similar to the Windows API that has a "A" version and a "W" version for most APIs since they weren't consistent to start with. (And in some cases, Microsoft has abandoned creating "A" versions altogether.)

You haven't said what language you're using. In some languages, changing from ASCII to Unicode may be pretty easy, whereas in others (which don't support Unicode) it might be pretty darn hard.
That said, maybe in your situation you shouldn't support Unicode: you can't think of a compelling reason why you should, and there are some reasons (i.e. your cost to change your existing libraries) which argue against. I mean, perhaps 'ideally' you should but in practice there might be some other, more important or more urgent, thing to spend your time and effort on at the moment.

If program takes text input from the user, it should use unicode; you never know what language the user is going to use.

When using Unicode, it leaves the door open for internationalization if requirements ever change and you are required to use text in other languages than English.
Also, in your new project you could always just write wrappers for the libraries that internally convert between ASCII and Unicode and vice-versa.

Your potential client may already be running a non-unicode application in a language other than English and won't be able to run your program without swichting the windows unicode locale back and forth, which will be a big pain.

Because the internet is overwhelmingly using Unicode. Web pages use unicode. Text files including your customer's documents, and the data on their clipboards, is Unicode.
Secondly Windows, is natively Unicode, and the ANSI APIs are a legacy.
Modern applications should use Unicode where applicable, which is almost everywhere.

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.

There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.

When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.

Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.

You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat

There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse