perl gethostbyname when given IP - perl

What happens if a wrong format IP is given to gethostbyname function in perl? One of our scripts was behaving weird when given a wrong format IP (say 1.1.1). On debugging, found that gethostbyname was returning a value when given 1.1.1 for example..any thoughts on this?... In my opinion, gethostbyname should return undef, right?

In the beginning of IPv4, before CIDR, addresses were considered to be composed of a network part and a host part. The parts could be written sort of independently in dotted decimal form, and didn't need to be fully decomposed into bytes. So 1.1 is host 1 on network 1, equivalent to 1.0.0.1 or you can also write it as one big 32-bit number: 16777217. There was a time when people used URLs like http://16777127/ to show how clever they were. That was ruined when spammers started doing it to fool filters.
Somehow, when I ping 1.1.1, it goes to 1.1.0.1. I would have guessed 1.0.1.1. I'm not sure what the rule is to decide how it's broken up exactly.
These old forms are not widely supported (or even understood) anymore, but they haven't been completely rooted out from all the tools and libraries.
P.S. on my first attempt to submit this answer, stackoverflow said:
Your post contains a link to the invalid domain '16777127'.
Please correct it by specifying a full domain or wrapping it in a code block.
Which is sort of what I meant by "not widely supported".

Numeric IPv4 addresses can be written as 1, 2, 3 or 4 numeric components. Each non-final component represents 8 bits (1 octet), and the final represents as many bits required to give the full 32 bit address. Thus, the following all represent the local loopback address:
2130706433
127.1
127.0.1
127.0.0.1
Each component itself may be written in decimal, hex or octal; thus the following all also encode the same address
0x7f000001
127.0x01
0177.0.1
0x7f.0.0.1

Related

Determining UTF encoding STRICTLY based off input and output

I have written a small "post processor" for our CNC machines': given a handful of inputs, it generates g-code that is more efficient than the machines' built-in functions.
The problem is that different machines in our shop use different encodings, so anything above 128 on the ASCII table comes out garbled.
I would love to output the correct encoding for each machine... But I have no (real) way to find out what that is; my best option is "guess and check". For example... If I have a comment in my code that says:
(Wall: 5°)
On machine A that becomes (Wall: 5°).
On machine B that becomes (Wall: 5B0).
I can feed these machines any character, and I can see how they interpret it.
How, if at all, can I use that information (and only that information) to determine the encoding scheme?
Thanks for any tips, advice, links, or reading you can provide (yes, I've read this).

What really is the maximum length of email address local part?

According to Wikipedia (https://en.wikipedia.org/wiki/Email_address) and http://isemail.info/about the maximal length of the local part of an email address is 64 characters.
However, I just received email from this address:
reply+0032ff332e028331fad75f7549ee52d90483c7aa70138a3192cf00000001123b88e492a169ce06aab82c#reply.github.com
Its local part is 90 characters and it is deemed invalid by isemail.info, however, it's a perfectly valid email address. I can send email to it and it is received by the other party.
So what gives: is not the maximal length of the local part of email address 64 characters or not? If not, what is the maximal length then?
The maximum length is 64 octets.
Yet as MSalters says in comments, just because something is done doesn't mean it's legal.
Some system accept longer local parts, some others don't. In this case, Github says that you should send an e-mail to them on that address. It's bad practice from Github because they might accept a longer e-mail address, but they forget that the client might be more pedantic and refuse to send (or worse, truncate the e-mail address).
They probably consider reply as the real local part and use +0032ff33... as an identifier, but all in all, as you point out, it makes their local part much (too?) bigger.

Why does MSDN advise use of unicode functions over ansi functions for winsock calls?

MSDN advises:
The getaddrinfo function is the ANSI version of a function that provides protocol-independent translation from host name to address. The Unicode version of this function is GetAddrInfoW. Developers are encouraged to use the GetAddrInfoW Unicode function rather than the getaddrinfo ANSI function.
Encouragement is fine and all, but is there any reason to do this? I mean, can hostnames contain non-ansi characters? If so, is this a feature exclusive to IPv6, or can IPv4 hostnames also contain unicode characters?
Microsoft is just trying to get everyone away from Ansi in general, that's all. They recommend using Unicode for everything, especially since Windows itself is based on Unicode (and has been for a long long time). But yes, as Jason said, hostnames can contain Unicode characters via Punycode encoding, which is backwards compatible with the existing Ansi-based DNS system.
DNS supports what are known as "internationalized domain names" via an encoding scheme called Punycode. So yes, hostnames can contain Unicode characters. It has nothing to do with IPv4 or IPv6, since that's a different network protocol entirely.

Are email addresses allowed to contain non-alphanumeric characters?

I'm building a website using Django. The website could have a significant number of users from non-English speaking countries.
I just want to know if there are any technical restrictions on what types of characters an email address could contain.
Are email addresses only allowed to contain English letters, numbers, _, # and .?
Are they allowed to contain non-English alphabets like é or ü?
Are they allowed to contain Chinese or Japanese or other Unicode characters?
Email address consists of two parts local before # and domain that goes after.
Rules to these parts are different:
For local part you can use ASCII:
Latin letters A - Z a - z
digits 0 - 9
special characters !#$%&'*+-/=?^_`{|}~
dot ., that it is not first or last, and not in sequence
space and "(),:;<>#[] characters are allowed with restrictions (they are only allowed inside a quoted string, a backslash or double-quote must be preceded by a backslash)
Plus since 2012 you can use international characters above U+007F, encoded as UTF-8.
Domain part is more restricted:
Latin letters A - Z a - z
digits 0 - 9
hyphen -, that is not first or last, multiple hyphens in sequence are allowed.
Regex to validate
^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#(([^<>()[\]\.,;:\s#\"]+\.)+[^<>()[\]\.,;:\s#\"]{2,})
Hope this saves you some time.
Well, yes. Read (at least) this article from Wikipedia.
I live in Argentina and here are allowed emails like ñoñó1234#server.com
The allowed syntax in an email address is described in [RFC 3696][1], and is pretty involved.
The exact rule [for local part; the part before the '#'] is that any ASCII character, including control
characters, may appear quoted, or in a quoted string. When quoting
is needed, the backslash character is used to quote the following
character
[...]
Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
[...]
Any characters, or combination of bits (as octets), are permitted in
DNS names. However, there is a preferred form that is required by
most applications...
...and so on, in some depth.
[1]: https://www.rfc-editor.org/rfc/rfc3696
Instead of worrying about what email addresses can and can't contain, which you really don't care about, test whether your setup can send them email or not—this is what you really care about! This means actually sending a verification email.
Otherwise, you can't catch a much more common case of accidental typos that stay within any character set you devise. (Quick: is random#mydomain.com a valid address for me to use at your site, or not?) It also avoids unnecessarily and gratuitously alienating any users when you tell them their perfectly valid and correct address is wrong. You still may not be able to process some addresses (this is necessary alienation), as the other answers say: email address processing isn't trivial; but that's something they need to find out if they want to provide you with an email address!
All you should check is that the user supplies some text before an #, some text after it, and the address isn't outrageously long (say 1000 characters). If you want to provide a warning ("this looks like trouble! is there a typo? double-check before continuing"), that's fine, but it shouldn't block the add-email-address process.
Of course, if you don't care to ever send email to them, then just take whatever they enter. For example, the address might solely be used for Gravatar, but Gravatar verifies all email addresses anyway.
There is a possibility to have non-ASCII email addresses, as shown by this RFC: https://www.rfc-editor.org/rfc/rfc3490 but I think this has not been set for all countries, and from what I understand only one language code will be allowed for each country, and there is also a way to turn it into ASCII, but that won't be a trivial issue.
I have encountered email addresses with single quotes, and not infrequently either. We reject whitespace (though strictly speaking it is allowed), more than one '#' sign and address strings shorter than five characters in total. I believe this solves more problems than it creates, and so far over ten years and several hundred thousand addresses it's worked to reject many garbage addresses. Also there is a trigger to downcase all email addresses on insert or update.
That being said it is impossible to validate an email without a round trip to the owner, but at least we can reject data that is extremely suspect.
I took a look at the regex in pooh17's answer and noticed it allows the local part to be greater than 64 characters if separated by periods (it just checked the bit before the first period is less than 64 characters). You can make use of positive lookahead to improve this, here's my suggestion if you're really wanting a regex for this
^(((?=.{1,64}#)[^<>()[\].,;:\s#"]+(\.[^<>()[\].,;:\s#"]+)*)|((?=.{1,66}#)".+"))#(?=.{1,255}$)(\[(IPv6:)?[\dA-Fa-f:.]+]|(?!.*?\.\.)(([^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]+\.?)+[^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]{2,}))$
Building on #Matas Vaitkevicius' answer: I've fixed up the regex some more in Python, to have it match valid email addresses as defined on this page and this page of wikipedia, using that awesome regex101 website: https://regex101.com/r/uP2oL7/26
^(([^<>()\[\]\.,;:\s#\"]{1,64}(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#\[*(?!.*?\.\.)(([^<>()[\]\.,;\s#\"]+\.?)+[^<>()[\]\.,;\s#\"]{2,})\]?
Hope this helps someone!:)

How can I compare international phone numbers in Perl?

Are there any modules that can help me compare phone numbers for equality?
For example, the following three numbers are equivalent (when dialling from the UK)
+44 (0)181 1234123
00441811234123
0181 1234123
Is there a perl module that can tell me this?
The closest I can see on CPAN is Number::Phone which is an active project, and supports UK Phone numbers. It should work for the specific example you give. A few countries are supported.
If you've got phone numbers for other countries things could get more difficult due to local formatting idiosyncrasies.
Supposing that the code you need doesn't exist, and you have to write it yourself, there are two basic operations that you need to do:
Apply context. This is where you take the location of the dialing phone into account. If the call isn't international, you supply the country code; if the call isn't long-distance, you provide an area code, etc. This requires some rules per-locale, of course.
Normalize. Remove meaningless spaces and punctuation, convert the international dialing prefix ("011" in NANPA, "00" in most of the rest of the world, but occasionally many weirder things) to the standard "+".
After completing those two steps properly, all inputs that are actually equivalent numbers should give identical output strings.