Lua pattern matching for email address - email

I having the following code:
if not (email:match("[A-Za-z0-9%.]+#[%a%d]+%.[%a%d]+")) then
print(false)
end
It doesn't currently catch
"test#yahoo,ca" or "test#test1.test2,com"
as an error.
I thought by limiting the input to %a - characters and %d - digits, I would by default catch any punctuation, including commas.
But I guess I'm wrong. Or there's something else that I'm just not seeing.
A second pair of eyes would be appreciated.

In the example of "test#test1.test2,com", the pattern matches test#test1.test2 and stops because of the following ,. It's not lying, it does match, just not what you expected. To fix, use anchors:
^[A-Za-z0-9%.]+#[%a%d]+%.[%a%d]+$
You can further simplify it to:
^[%w.]+#%w+%.%w+$
in which %w matches an alphanumeric character.

I had a hard time finding a true email validation function for Lua.
I couldn't find any that would allow some of the special cases that emails to allow. Things like + or quotes are actually acceptable in emails.
I wrote my own Lua function that could pass all the tests that are outlined in the spec for email addresses.
http://ohdoylerules.com/snippets/validate-email-with-lua
I also added a bunch of commentd, so if there is some strange validation that you want to ignore, just remove the if statement for that particular check.

Related

What characters are allowed in the name of a rule in Drools?

I haven't been able to find in Drools documentation, which characters (beyond alphabet letters) are allowed/disallowed in a rule name in Drools - does anyone know or have a reference?
The only relevant section of Drools doc I've found so far does not specify:
Each rule must have a unique name within the rule package. If you use the same rule name more than once in any DRL file in the package, the rules fail to compile. Always enclose rule names with double quotation marks (rule "rule name") to prevent possible compilation errors, especially if you use spaces in rule names.
I think I have discovered, anecdotally, that some "grouping" characters do not work in rule names (seems rules named with can't be found or aren't included) - or at least, in extension rules (the extended rule seems to work with grouping chars, but not its extension; example below): The grouping chars include parentheses "()", square brackets "[]", and "curly braces" "{}". Although less than & greater than "<>" work, so I'm so far replacing the former with the latter.
Or are there escape chars for the problematic grouping chars?
Example:
rule "(grouping chars, and commas, work here)"
when
// conditions LHS
then
end
// removing parentheses, or replacing with < >,
// from below line works
rule "(grouping chars DON'T work here)"
extends "(grouping chars, and commas, work here)"
when
then
// consequences RHS
I haven't discovered either way yet with all other characters (for example, other punctuation; except I have discovered commas "," work). But it would be nice to know ahead of time what characters are allowed.
Theoretically every identifier inside a string should work, but you might have empirically found some combination that is breaking the grammar somehow.
Thanks for the investigation, I've filled a Jira, please take a look at it

Looking for a character that is allowed in Filenames but not allowed in email addresses... Any clue?

I am trying to create multiple html files that are associated with an email address. But since the "#" cannot be used in filenames, and in order to avoid confusion, I am trying to replace it with a character that won't normally exist in an email address.
Anything comes in mind?
Thanks!
Comma and semi-colon is not allowed in email address but in filenames on most file systems.
I believe '~' is used for this purpose.
According to the link here almost all ASCII characters are allow in email addresses so long as the special characters aren't at the beginning or the end.
What characters are allowed in an email address?
Any of , (comma) ; (semi-colon) <> (angle brackets) [] (square brackets) or " (double quote) should work for most cases.
Since these characters are allowed in quoted strings, you could replace the "#" with a sequence that would be invalid such as three double quotes in a row.
According to the RFC
within a quoted string, any ASCII graphic or space is permitted without blackslash-quoting except double-quote and the backslash itself.
You could have an email abc."~~~".def#rst.xyz. But you could not have abc.""".def#rst.xyz; it would have to be abc.""".def#rst.xyz. So you could safely use """ as a substitute for # in the filename.
However, the RFC also says
While the above definition for Local-part is relatively permissive,
for maximum interoperability, a host that expects to receive mail
SHOULD avoid defining mailboxes where the Local-part requires (or
uses) the Quoted-string form or where the Local-part is case-
sensitive.
With SHOULD meaning "...that
there may exist valid reasons in particular circumstances when the
particular behavior is acceptable or even useful, but the full
implications should be understood and the case carefully weighed
before implementing..." RFC2119
So, although """ will work, are the chances you will see an email with quotes worth the trouble of designing for it? If not, then use one of the single characters.

Exchange Server Transport Rule Failing Emails From .mil

I am using Exchange Server 2013 and have many transport rules set up to filter out emails from most countries outside of the US.
We recently received an email from a military email, ending in .mil
The email was blocked by my transport rules but does not match any of the extensions I have listed. Except for possibly one! I have an extension to block '.il$'. So this should block ALL emails that end with ".il". However, if the transport rules use true regular expression rules, the "." would be a wildchar and match any and every character including a "." itself. Is this the cause of my issue? I do not have a .mil email account to test with or I could check myself. I have added a character escape to my transport rule, making it '\.il$' hoping that it will fix this.
I read everything I can find about the regex rules for Exchange's Transport Rules, and I cannot find anything that mentions you must escape the dot. Maybe this is just a rare issue and they didn't foresee it occurring?
One of the documents I have read: https://technet.microsoft.com/en-us/library/aa997187(v=exchg.141).aspx
Long story short: YES, the dot(.) must be escaped with a \. Otherwise it is a single wildchar that matches any character [A-Z a-z 0-9 . , /] etc. just like in regular expression. I assume that Microsoft is using every rule from regular expression for the transport rules but do not quote me on that.
This cannot be found in any documentation that I have researched, it also seems that every example that I have looked at on the web has been doing it wrong as well. Examples that I see are always ".com$" will block all emails from a sender ending in .com. This is true because the dot can also be a dot. But this will also block any emails that end in "ecom" for example, which may be an issue if they ever decide to release such extension.
Sorry for answering my own question, but I want this to be here for future reference since it can't seem to be found anywhere else.

Can actions in Lex access individual regex groups?

Can actions in Lex access individual regex groups?
(NOTE: I'm guessing not, since the group characters - parentheses - are according to the documentation used to change precedence. But if so, do you recommend an alternative C/C++ scanner generator that can do this? I'm not really hot on writing my own lexical analyzer.)
Example:
Let's say I have this input: foo [tagName attribute="value"] bar and I want to extract the tag using Lex/Flex. I could certainly write this rule:
\[[a-z]+[[:space:]]+[a-z]+=\"[a-z]+\"\] printf("matched %s", yytext);
But let's say I would want to access certain parts of the string, e.g. the attribute but without having to parse yytext again (as the string has already been scanned it doesn't really make sense to scan part of it again). So something like this would be preferable (regex groups):
\[[a-z]+[[:space:]]+[a-z]+=\"([a-z]+)\"\] printf("matched attribute %s", $1);
You can separate it to start conditions. Something like this:
%x VALUEPARSE ENDSTATE
%%
char string_buf[100];
<INITIAL>\[[a-z]+[[:space:]]+[a-z]+=\" {BEGIN(VALUEPARSE);}
<VALUEPARSE>([a-z]+) (strncpy(string_buf, yytext, yyleng);BEGIN(ENDSTATE);} //getting value text
<ENDSTATE>\"\] {BEGIN(INITIAL);}
%%
About an alternative C/C++ scanner generator - I use QT class QRegularExpression for same things, it can very easy get regex group after match.
Certainly at least some forms of them do.
But the default lex/flex downloadable from sourceforge.org do not seem to list it in their documentation, and this example leaves the full string in yytext.
From IBM's LEX documentation for AIX:
(Expression)
Matches the expression in the parentheses.
The () (parentheses) operator is used for grouping and causes the expression within parentheses to be read into the yytext array. A group in parentheses can be used in place of any single character in any other pattern.
Example: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.

Are email addresses allowed to contain non-alphanumeric characters?

I'm building a website using Django. The website could have a significant number of users from non-English speaking countries.
I just want to know if there are any technical restrictions on what types of characters an email address could contain.
Are email addresses only allowed to contain English letters, numbers, _, # and .?
Are they allowed to contain non-English alphabets like é or ü?
Are they allowed to contain Chinese or Japanese or other Unicode characters?
Email address consists of two parts local before # and domain that goes after.
Rules to these parts are different:
For local part you can use ASCII:
Latin letters A - Z a - z
digits 0 - 9
special characters !#$%&'*+-/=?^_`{|}~
dot ., that it is not first or last, and not in sequence
space and "(),:;<>#[] characters are allowed with restrictions (they are only allowed inside a quoted string, a backslash or double-quote must be preceded by a backslash)
Plus since 2012 you can use international characters above U+007F, encoded as UTF-8.
Domain part is more restricted:
Latin letters A - Z a - z
digits 0 - 9
hyphen -, that is not first or last, multiple hyphens in sequence are allowed.
Regex to validate
^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#(([^<>()[\]\.,;:\s#\"]+\.)+[^<>()[\]\.,;:\s#\"]{2,})
Hope this saves you some time.
Well, yes. Read (at least) this article from Wikipedia.
I live in Argentina and here are allowed emails like ñoñó1234#server.com
The allowed syntax in an email address is described in [RFC 3696][1], and is pretty involved.
The exact rule [for local part; the part before the '#'] is that any ASCII character, including control
characters, may appear quoted, or in a quoted string. When quoting
is needed, the backslash character is used to quote the following
character
[...]
Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
[...]
Any characters, or combination of bits (as octets), are permitted in
DNS names. However, there is a preferred form that is required by
most applications...
...and so on, in some depth.
[1]: https://www.rfc-editor.org/rfc/rfc3696
Instead of worrying about what email addresses can and can't contain, which you really don't care about, test whether your setup can send them email or not—this is what you really care about! This means actually sending a verification email.
Otherwise, you can't catch a much more common case of accidental typos that stay within any character set you devise. (Quick: is random#mydomain.com a valid address for me to use at your site, or not?) It also avoids unnecessarily and gratuitously alienating any users when you tell them their perfectly valid and correct address is wrong. You still may not be able to process some addresses (this is necessary alienation), as the other answers say: email address processing isn't trivial; but that's something they need to find out if they want to provide you with an email address!
All you should check is that the user supplies some text before an #, some text after it, and the address isn't outrageously long (say 1000 characters). If you want to provide a warning ("this looks like trouble! is there a typo? double-check before continuing"), that's fine, but it shouldn't block the add-email-address process.
Of course, if you don't care to ever send email to them, then just take whatever they enter. For example, the address might solely be used for Gravatar, but Gravatar verifies all email addresses anyway.
There is a possibility to have non-ASCII email addresses, as shown by this RFC: https://www.rfc-editor.org/rfc/rfc3490 but I think this has not been set for all countries, and from what I understand only one language code will be allowed for each country, and there is also a way to turn it into ASCII, but that won't be a trivial issue.
I have encountered email addresses with single quotes, and not infrequently either. We reject whitespace (though strictly speaking it is allowed), more than one '#' sign and address strings shorter than five characters in total. I believe this solves more problems than it creates, and so far over ten years and several hundred thousand addresses it's worked to reject many garbage addresses. Also there is a trigger to downcase all email addresses on insert or update.
That being said it is impossible to validate an email without a round trip to the owner, but at least we can reject data that is extremely suspect.
I took a look at the regex in pooh17's answer and noticed it allows the local part to be greater than 64 characters if separated by periods (it just checked the bit before the first period is less than 64 characters). You can make use of positive lookahead to improve this, here's my suggestion if you're really wanting a regex for this
^(((?=.{1,64}#)[^<>()[\].,;:\s#"]+(\.[^<>()[\].,;:\s#"]+)*)|((?=.{1,66}#)".+"))#(?=.{1,255}$)(\[(IPv6:)?[\dA-Fa-f:.]+]|(?!.*?\.\.)(([^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]+\.?)+[^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]{2,}))$
Building on #Matas Vaitkevicius' answer: I've fixed up the regex some more in Python, to have it match valid email addresses as defined on this page and this page of wikipedia, using that awesome regex101 website: https://regex101.com/r/uP2oL7/26
^(([^<>()\[\]\.,;:\s#\"]{1,64}(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#\[*(?!.*?\.\.)(([^<>()[\]\.,;\s#\"]+\.?)+[^<>()[\]\.,;\s#\"]{2,})\]?
Hope this helps someone!:)