Are IDN domain names case-sensitive? - unicode

Some people will reply that domain names are not case-sensitive. In the new Unicode world this is no longer true.
(Source)
I thought one of the steps in the Unicode > Punycode conversion was a "normalisation", which rendered domain names lower case.

For old-fashioned ASCII-based domain names, Yes, domain names have been and continue to be case-insensitive.
To quote RFC 1035, DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION:
Note that while upper and lower case letters are allowed in domain names, no significance is attached to the case. That is, two names with the same spelling but different case are to be treated as if identical.
For example, all of these represent the same domain:
example.com
Example.com
EXAMPLE.COM
EXampLE.com
In modern DNS, we now have Internationalized Domain Names (IDN) which allows Unicode characters. The problem is that defining upper- and lowercase can be tricky in some languages and character sets beyond ASCII (Unicode is a superset of US-ASCII).
The intent of domain names is to be case-insensitive, but there may be complications with particular characters in particular scripts of particular human languages. So there is no simple YES or NO answer to your question.
If using non-ASCII domain names, you should read:
Internationalized domain name on Wikipedia
Domain Name System (DNS) Case Insensitivity Clarification Official spec (IETF RFC 4343)

WRONG: URLs are still case insensitive, even for IDN.
CORRECTION:
The question was about IDN: "Are IDN domain names case-sensitive?"
My initial answer is wrong, and does not clearly answer the question.
It brings URLs into the mix.
The domain name part (IDN) of a URL is case-insensitive.
The other elements might be case-insensitive or not. It depends on many things, and in general is not predictable.
For instance the path part would normally depend on the OS or even the file system hosting the site (on MacOS you can format the drive as case insensitive or not)
But these days you can have some of these paths "hooked" to answer RESTfull APIs.
So it depends on how the "hook" is done.
Similar for other elements (user, password, parameters, parameter values)

Related

Underscores vs dashes in HTTP parameter names

I'm familiar with the convention of using hyphens to separate words in URL paths. What about parameter names, such as within a <form>:
<form>
<input name="my_special_field">
</form>
is that better or my-special-field? I've seen Google use underscores in analytics with utm_campaign and other parameter names. Underscores read a little better and allow for the occasional hyphen within the name (field_for_5-16-17). But hyphens are certainly the convention for URL paths.
What's the convention for separating words in an HTTP parameter name?
What's the convention for separating words in an HTTP parameter name?
Well, I think there is none. I hear hyphens perform a bit better SEO-wise. But as long as you are compliant with RFC 3986 (especially section 3.4), everything is okay.
If it really interests you, part of the dilemma is that the query string has never been formalized. There is only a consensus on which characters are supposed to be allowed in it.

Hyphen, underscore, or camelCase as word delimiter in URIs?

I'm designing an HTTP-based API for an intranet app. I realize it's a pretty small concern in the grand scheme of things, but: should I use hyphens, underscores, or camelCase to delimit words in the URIs?
Here are my initial thoughts:
camelCase
possible issues if server is case-insensitive
seems to have fairly widespread use in query string keys (http://api.example.com?**searchQuery**=...), but not in other URI parts
Hyphen
more aesthetically pleasing than the other alternatives
seems to be widely used in the path portion of the URI
never seen hyphenated query string key in the wild
possibly better for SEO (this may be a myth)
Underscore
potentially easier for programming languages to handle
several popular APIs (Facebook, Netflix, StackExchange, etc.) are using underscores in all parts of the URI.
I'm leaning towards underscores for everything. The fact that most of the big players are using them is compelling (see https://stackoverflow.com/a/608458/360570).
You should use hyphens in a crawlable web application URL. Why? Because the hyphen separates words (so that a search engine can index the individual words), and a hyphen is not a word character. Underscore is a word character, meaning it should be considered part of a word.
Double-click this in Chrome: camelCase
Double-click this in Chrome: under_score
Double-click this in Chrome: hyphen-ated
See how Chrome (I hear Google makes a search engine too) only thinks one of those is two words?
camelCase and underscore also require the user to use the shift key, whereas hyphenated does not.
So if you should use hyphens in a crawlable web application, why would you bother doing something different in an intranet application? One less thing to remember.
The standard best practice for REST APIs is to have a hyphen, not camelcase or underscores.
This comes from Mark Masse's "REST API Design Rulebook" from Oreilly.
In addition, note that Stack Overflow itself uses hyphens in the URL: .../hyphen-underscore-or-camelcase-as-word-delimiter-in-uris
As does WordPress: http://inventwithpython.com/blog/2012/03/18/how-much-math-do-i-need-to-know-to-program-not-that-much-actually
Short Answer:
lower-cased words with a hyphen as separator
Long Answer:
What is the purpose of a URL?
If pointing to an address is the answer, then a shortened URL is also doing a good job. If we don't make it easy to read and maintain, it won't help developers and maintainers alike. They represent an entity on the server, so they must be named logically.
Google recommends using hyphens
Consider using punctuation in your URLs. The URL http://www.example.com/green-dress.html is much more useful to us than http://www.example.com/greendress.html. We recommend that you use hyphens (-) instead of underscores (_) in your URLs.
Coming from a programming background, camelCase is a popular choice for naming joint words.
But RFC 3986 defines URLs as case-sensitive for different parts of the URL.
Since URLs are case sensitive, keeping it low-key (lower cased) is always safe and considered a good standard. Now that takes a camel case out of the window.
Source: https://metamug.com/article/rest-api-naming-best-practices.html#word-delimiters
Whilst I recommend hyphens, I shall also postulate an answer that isn't on your list:
Nothing At All
My company's API has URIs like /quotationrequests/, /purchaseorders/ and so on.
Despite you saying it was an intranet app, you listed SEO as a benefit. Google does match the pattern /foobar/ in a URL for a query of ?q=foo+bar
I really hope you do not consider executing a PHP call to any arbitrary string the user passes in to the address bar, as #ServAce85 suggests!
In general, it's not going to have enough of an impact to worry about, particularly since it's an intranet app and not a general-use Internet app. In particular, since it's intranet, SEO isn't a concern, since your intranet shouldn't be accessible to search engines. (and if it is, it isn't an intranet app).
And any framework worth it's salt either already has a default way to do this, or is fairly easy to change how it deals with multi-word URL components, so I wouldn't worry about it too much.
That said, here's how I see the various options:
Hyphen
The biggest danger for hyphens is that the same character (typically) is also used for subtraction and numerical negation (ie. minus or negative).
Hyphens feel awkward in URL components. They seem to only make sense at the end of a URL to separate words in the title of an article. Or, for example, the title of a Stack Overflow question that is added to the end of a URL for SEO and user-clarity purposes.
Underscore
Again, they feel wrong in URL components. They break up the flow (and beauty/simplicity) of a URL, since they essentially add a big, heavy apparent space in the middle of a clean, flowing URL.
They tend to blend in with underlines. If you expect your users to copy-paste your URLs into MS Word or other similar text-editing programs, or anywhere else that might pick up on a URL and style it with an underline (like links traditionally are), then you might want to avoid underscores as word separators. Particularly when printed, an underlined URL with underscores tends to look like it has spaces in it instead of underscores.
CamelCase
By far my favorite, since it makes the URLs seem to flow better and doesn't have any of the faults that the previous two options do.
Can be slightly harder to read for people that have a hard time differentiating upper-case from lower-case, but this shouldn't be much of an issue in a URL, because most "words" should be URL components and separated by a / anyways. If you find that you have a URL component that is more than 2 "words" long, you should probably try to find a better name for that concept.
It does have a possible issue with case sensitivity, but most platforms can be adjusted to be either case-sensitive or case-insensitive. Any it's only really an issue for 2 cases: a.) humans typing the URL in, and b.) Programmers (since we are not human) typing the URL in. Typos are always a problem, regardless of case sensitivity, so this is no different that all one case.
It is recommended to use the spinal-case (which is highlighted by
RFC3986), this case is used by Google, PayPal, and other big
companies.
source:- https://blog.restcase.com/5-basic-rest-api-design-guidelines/
EDIT: Although the highlight on the RFC is nowhere to be found, the recommendation on spinal case is still valid (as already noted in other answers)
We should use hyphens in web page URLs to convince search engines to index each keyword in the URL separately.
here's the best of both worlds.
I also "like" underscores, besides all your positive points about them, there is also a certain old-school style to them.
So what I do is use underscores and simply add a small rewrite rule to your Apache's .htaccess file to re-write all underscores to hyphens.
https://yoast.com/apache-rewrite-dash-underscore/

Is there a term for the string in an email address before the # character?

When describing a URL, there are well-defined terms for each part of the URL, like protocol, hostname, path and query.
When describing an email address, the part after the # character is the domain. Is there a universally accepted term for the part before the # character?
I'd generally refer to it as the mailbox, or the username. I'm not aware of a single canonical descriptor that is used in daily language, but I don't talk about the parts of email addresses much.
Probably the most correct formal name for it is the "local-part", as referenced in the rfc:
https://www.rfc-editor.org/rfc/rfc5321#section-2.3.11
I've also heard the term "stem" used, but might be confusing that with some other terminology.

Are email addresses allowed to contain non-alphanumeric characters?

I'm building a website using Django. The website could have a significant number of users from non-English speaking countries.
I just want to know if there are any technical restrictions on what types of characters an email address could contain.
Are email addresses only allowed to contain English letters, numbers, _, # and .?
Are they allowed to contain non-English alphabets like é or ü?
Are they allowed to contain Chinese or Japanese or other Unicode characters?
Email address consists of two parts local before # and domain that goes after.
Rules to these parts are different:
For local part you can use ASCII:
Latin letters A - Z a - z
digits 0 - 9
special characters !#$%&'*+-/=?^_`{|}~
dot ., that it is not first or last, and not in sequence
space and "(),:;<>#[] characters are allowed with restrictions (they are only allowed inside a quoted string, a backslash or double-quote must be preceded by a backslash)
Plus since 2012 you can use international characters above U+007F, encoded as UTF-8.
Domain part is more restricted:
Latin letters A - Z a - z
digits 0 - 9
hyphen -, that is not first or last, multiple hyphens in sequence are allowed.
Regex to validate
^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#(([^<>()[\]\.,;:\s#\"]+\.)+[^<>()[\]\.,;:\s#\"]{2,})
Hope this saves you some time.
Well, yes. Read (at least) this article from Wikipedia.
I live in Argentina and here are allowed emails like ñoñó1234#server.com
The allowed syntax in an email address is described in [RFC 3696][1], and is pretty involved.
The exact rule [for local part; the part before the '#'] is that any ASCII character, including control
characters, may appear quoted, or in a quoted string. When quoting
is needed, the backslash character is used to quote the following
character
[...]
Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
[...]
Any characters, or combination of bits (as octets), are permitted in
DNS names. However, there is a preferred form that is required by
most applications...
...and so on, in some depth.
[1]: https://www.rfc-editor.org/rfc/rfc3696
Instead of worrying about what email addresses can and can't contain, which you really don't care about, test whether your setup can send them email or not—this is what you really care about! This means actually sending a verification email.
Otherwise, you can't catch a much more common case of accidental typos that stay within any character set you devise. (Quick: is random#mydomain.com a valid address for me to use at your site, or not?) It also avoids unnecessarily and gratuitously alienating any users when you tell them their perfectly valid and correct address is wrong. You still may not be able to process some addresses (this is necessary alienation), as the other answers say: email address processing isn't trivial; but that's something they need to find out if they want to provide you with an email address!
All you should check is that the user supplies some text before an #, some text after it, and the address isn't outrageously long (say 1000 characters). If you want to provide a warning ("this looks like trouble! is there a typo? double-check before continuing"), that's fine, but it shouldn't block the add-email-address process.
Of course, if you don't care to ever send email to them, then just take whatever they enter. For example, the address might solely be used for Gravatar, but Gravatar verifies all email addresses anyway.
There is a possibility to have non-ASCII email addresses, as shown by this RFC: https://www.rfc-editor.org/rfc/rfc3490 but I think this has not been set for all countries, and from what I understand only one language code will be allowed for each country, and there is also a way to turn it into ASCII, but that won't be a trivial issue.
I have encountered email addresses with single quotes, and not infrequently either. We reject whitespace (though strictly speaking it is allowed), more than one '#' sign and address strings shorter than five characters in total. I believe this solves more problems than it creates, and so far over ten years and several hundred thousand addresses it's worked to reject many garbage addresses. Also there is a trigger to downcase all email addresses on insert or update.
That being said it is impossible to validate an email without a round trip to the owner, but at least we can reject data that is extremely suspect.
I took a look at the regex in pooh17's answer and noticed it allows the local part to be greater than 64 characters if separated by periods (it just checked the bit before the first period is less than 64 characters). You can make use of positive lookahead to improve this, here's my suggestion if you're really wanting a regex for this
^(((?=.{1,64}#)[^<>()[\].,;:\s#"]+(\.[^<>()[\].,;:\s#"]+)*)|((?=.{1,66}#)".+"))#(?=.{1,255}$)(\[(IPv6:)?[\dA-Fa-f:.]+]|(?!.*?\.\.)(([^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]+\.?)+[^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]{2,}))$
Building on #Matas Vaitkevicius' answer: I've fixed up the regex some more in Python, to have it match valid email addresses as defined on this page and this page of wikipedia, using that awesome regex101 website: https://regex101.com/r/uP2oL7/26
^(([^<>()\[\]\.,;:\s#\"]{1,64}(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#\[*(?!.*?\.\.)(([^<>()[\]\.,;\s#\"]+\.?)+[^<>()[\]\.,;\s#\"]{2,})\]?
Hope this helps someone!:)

How do I upper case an email address?

I expect this should be a pretty easy question. It is in two parts:
Are email addresses case sensitive? (i.e. is foo#bar.com different from Foo#bar.com?)
If so, what is the correct locale to use for capitalising an email address? (i.e. capitalising the email tim#foo.com would be different in the US and Turkish locales)
Judging from the specs the first part can be case sensitive, but normally it's not.
Since it's all ASCII you should be safe using a "naive" uppercase function.
Check out the RFC spec part of the wikipedia article on E-mail adresses
If you're in for some heavier reading RFC5322 and RFC5321 should be useful too.
The local-part of the email address (i.e. before the #) is case-sensitive in general. From the Wikipedia entry on E-mail address:
The local-part is case sensitive, so
"jsmith#example.com" and
"JSmith#example.com" may be delivered
to different people. This practice is,
however, discouraged by RFC 5321.
However, only the authoritative
mail-servers for a domain may make
that decision.
For the detailed specifications, you may wish to consult the following RFCs:
RFC 5321: Simple Mail Transfer Protocol
RFC 5322: Internet Message Format
RFC 3696: Application Techniques for Checking and Transformation of Names
domain names are case insensitive.
so foo#BAR.COM is the same email as foo#bar.com
for user names, it depends of the mail server. in the Outlook server my company uses it is also case insensitive
Email address are not case sensitive.
The local-part of the e-mail address
may use any of these ASCII characters:
Uppercase and lowercase English
letters (a-z, A-Z)
Digits 0 through 9
Characters ! # $ % & ' * + - / = ? ^
_ ` { | } ~
Character . provided that it is not
the first nor last character, nor
may it appear two or more times
consecutively.
Source :Wikipedia