We are using the following validation code to check for a valid email address formatting on a web form driving by Lotus Notes:
#If((#ValidateInternetAddress([Address821]; #ThisValue)!=""
| #Contains(#ThisValue; "\"") | #Contains(#ThisValue; "'")
| #Contains(#ThisValue; " ")); "Please include a valid email address."; "");
Currently, if a user enters any of the following inputs, the verification throws the error message:
empty field
" ", ', or / character
the domain portion of the email: "test.com"
only #
However, if a user enters test#test the form validates this as a valid email address format.
Is this format considered to be a valid "Address821" format? Or is the form validating an incorrect format as a valid email address?
Yes, it technically is valid address syntax, both by past and current standards.
The language in the RFC's has evolved over time:
RFC-821: 3.7. DOMAINS
Domains are a recently introduced concept in the ARPA Internet mail
system. The use of domains changes the address space from a flat
global space of simple character string host names to a hierarchically
structured rooted tree of global addresses. The host name is replaced
by a domain and host designator which is a sequence of domain element
strings separated by periods with the understanding that the domain
elements are ordered from the most specific to the most general.
This isn't very precise. It doesn't explicitly say that there must be more than one element in the domain name, but it doesn't explicitly prohibit it either. But this was obsoleted by:
RFC-2821: 2.3.5 Domain
A domain (or domain name) consists of one or more dot-separated
components.
...
The domain name, as described in this document and in [22], is the entire, fully-qualified name (often referred to as an "FQDN"). A domain name that is not in FQDN form is no more than a local alias. Local aliases MUST NOT appear in any SMTP transaction.
This seems to be saying that it's illegal, but actually it isn't saying that. I'll explain below, but first let's have a look at the draft standard that is intended to obsolete 2821, and which clarifies things a great deal:
RFC-5321 2.3.5 Domain Names
A domain name (or often just a "domain") consists of one or more components, separated by dots if more than one appears. In the case of a top-level domain used by itself in an email address, a single string is used without any dots. This makes the requirement, described in more detail below, that only fully-qualified domain names appear in SMTP transactions on the public Internet, particularly important where top-level domains are involved.
...
The domain name, as described in this document and in RFC 1035 [2], is the entire, fully-qualified name (often referred to as an "FQDN"). A domain name that is not in FQDN form is no more than a local alias. Local aliases MUST NOT appear in any SMTP transaction.
What this makes clear is that no dot is required in a domain name, as long as it is a top level domain.
#ValidateInternetAddress cannot reasonably know whether "test" is a valid top level domain. Even if IBM programmed in the list of approved public TLD's (which IMHO would be a bad idea since it can and does change), you can in fact set up a private TLD called "test" in your own DNS. That's not the same thing as a "local alias" which the standard does prohibit. There's no rule against actual TLDs.
And for that matter, it could even be a public TLD. Theoretically, the owner of a TLD could set up a mail server for the TLD. I.e., President#US, or Queen#UK. Not likely, but possible in those cases, but with all the new TLD's coming on line, I wouldn't be surprised if some of the registrars are using info#domain.
I guess theoretically #ValidateInternetAddress could make the DNS call to check whether it can resolve "test" as a TLD, but the doc for that function only says that it checks the syntax of the address, and the existence of the TLD is a semantic issue, not a syntax issue.
Related
So would username#gtld be a valid email? As a practical example google is purchasing the gTLD "gmail". Obviously they can associate A records with that permitting you to just type http://gmail/ to access the site. But, are there any specs that prohibit them from associating MX records with that as well, allowing folks to give out an alternative address username#gmail?
I ask because I want to make sure our email validator is future proof and technically correct.
I think I answered my own question. Section 3.4.1 of rfc5322 which defines a valid email address states:
addr-spec = local-part "#" domain
[...]
domain = dot-atom / domain-literal / obs-domain
[...]
The domain portion identifies the point to which the mail is delivered. In the dot-atom form, this is interpreted as an Internet domain name (either a host name or a mail exchanger name) as described in [RFC1034], [RFC1035], and [RFC1123]. In the domain-literal form, the domain is interpreted as the literal Internet address of the particular host.
"gmail" would be a valid domain and host name and thus someone#gtld is a valid email address.
There's a pub in my town whereby, if you sign up to their newsletter using their website and provide a "unique" email address, you get a free drink. On a whim, I decided to sign up a second time using myemail+one#gmail.com. It let me. I'm now sitting on a nice comfy pile of free drink vouchers.
This got me thinking about a system we have here, where the email address is considered the unique identifier. Checking the code, sure enough, if we were offering vouchers in our business, someone else would be sitting pretty.
The basic, stab-in-the-dark, fix is to check for the "+" character and ignore everything after it (up to the #), and compare using that. But I am unsure if this was the intent for the + character. Would that work?
Secondly, are there any other caveats that would allow a user to sign up multiple times with a seemingly different email address, but which actually would always end up in the same mailbox?
This question is language-agnostic.
While using a plus sign as an e-mail address alias is a known feature of gmail, other mailers do either not allow it or use a minus sign instead. '+' is a legitimate character to be used as part of an email address according to the RFC.
The use of '.' is also a gray area. john.doe#gmail.com and johndoe#gmail.com send also both to the same email address and look different.
In order to validate the uniqueness of an email address you will have to prepare a rule base for your application, keep it up to date and still expect surprises...
I'm trying to test a new email validation function I've written, based on this one., but with some minor adjustments.
From a large set of valid and invalid entries, the function finds just one false negative - an address which has an IPv6 address instead of a domain.
user#[IPv6:2001:db8:1ff::a0b:dbd0]
The source is this wikipedia page: Email Addresses
However, System.Net.IPAddress fails to parse IPv6:2001:db8:1ff::a0b:dbd0, and I can't find any references in the RFC4291 to any prefix of IPv6.
Obviously, IPv6:2001:db8:1ff::a0b:dbd0 is not a valid IPv6 address, but is it valid in an email address? Or is wikipedia wrong?
Should the actual email be user#[2001:db8:1ff::a0b:dbd0] Anyone know?
You are right to look at RFC4291 for the IPv6 address format. However, for SMTP (and thus for any other email software handling addresses) you should also look at Address Literals in RFC5321.
The one you want is probably "IPv6-address-literal".
For those still looking for this, the IPv6: prefix tag is required.
https://www.rfc-editor.org/rfc/rfc5321#section-4.1.3
For IPv6 and other forms of addressing that might eventually be standardized, the form consists of a standardized "tag" that identifies the address syntax, a colon, and the address itself ...
If user entered say google then
i need to add http://www.google.com ie missing part.
User may enter any thing say google.in or www.google or anything.
Now goal to complete the left over url as we check url using regex like this:
NSString *urlRegEx = #"(http|https)://((\\w)*|([0-9]*)|([-|_])*)+([\\.|/]((\\w)*|([0-9]*)|([-|_])*))+";
That given url is valid or not
Don't forget the third level domain names with suffixes like .co.uk, .co.us, .com.co, etc.
A fully qualified domain name must have at least one dot. If it doesn't have at least one dot, then you might add .com to the end.
If it does have at least one dot, then it gets more complicated. .google could be a top level domain in the future, though it isn't now. Perhaps you want to keep a white list of all "valid" first and second level domain names. You evaluate the entered domain name from the right until it stops matching domains from your list. The remainder is the "registered" domain name and any sub domains. If you don't find any matches, then add .com.
Alternatively, rather than parsing the domain name, you could just try to resolve it, and if it doesn't resolve, then add .com and try again.
I think I know what you're asking however other than the regex how are you actually validating that the URL is valid? It seems as though you're making an incorrect assumption that all URLs follow a common syntax. As as example, http://www.www.extra-www.org/ is a valid URL so if you apply your regex (as I understand your intent) the user would get to http://www.extra-www.org which may not be the same site as the one the user wanted (even though in this case it is because it forwards). Another example is http://www.www.com... if the user enters "www" your regex will kill it. A final example is if a site doesn't have its DNS registered WITH the "www" - your regex will incorrectly add the "www" piece.
EDIT: what happens if the user needs https as opposed to http and only enters "google"?
Haven't tried this one myself yet, but an NSDataDetector for the type NSTextCheckingTypeLink should be able to do you a decent job.
I am working on a more complete email validator in java and came across an interesting ability to embed comments within an email both in the "username" and "address" portions.
The following snippet from http://www.dominicsayers.com/isemail/ has this to say about comments within an email.
Comments are text surrounded by parentheses (like these). These are OK but don't form part of the address. In other words mail sent to first.last#example.com will go to the same place as first(a).last(b)#example(c).com(d). Strange but true.
Do emails like this really exist ?
Is it possible to form hosts like this ?
I tried entering an url such as "google(ignore).com" but firefox and some other browsers failed and i was wondering is this because its it wrong or is it because they dont know about host name comments ?
That syntax -- comments within an addr-spec -- was indeed permissible by the original email RFC, RFC 822. However, the placement of comments like you'd like to use them was deprecated when that RFC was revised by RFC 2822... 10 years ago. It's still marked as obsolete in the current version, RFC 5322. There's no good excuse for emitting anything using that syntax.
Address parsers are supposed to be backwards-compatible in order to cover all conceivable cases, including 10-years-deprecated bits like the one you're trying to take advantage of here. But I'll bet that many, many receiving mail agents will fail to properly parse out those comments. So even though you may have technically found a loophole via the "obsolete addressing" section of the RFC, it's not likely to do you much good in practice.
As for HTTP, the syntax rules aren't the same as email syntax rules. As you're seeing, the comment section from RFC 822 isn't applicable.
Just because you can do it in the spec, doesn't mean that you should. For example, Gmail will not accept that comment format for address.
Second, (to your last point), paren-comments being allowed in email addresses doesn't mean that they work for URLs.
Finally, my advice: I'd tailor the completeness of your validator to your requirements. If you're writing a new MTA (mail transfer agent), you'll probably have to do it all. If you're writing a validator for a user input, keep it simple:
look for one #,
make sure you have stuff before (username) and after (domain name),
make sure you have a "dot" in the hostname string,
[extra credit] do a DNS lookup of the hostname to make sure it resolves.