I need to process phone numbers into canonical form for a particular application that doesn't have any built in parsing for free-form entry. The only case I am stuck on is 00 800 international freephone numbers, which by definition have no County Code. Is there a canonical form for this?
(I appreciate this is not a programming problem per se, but have failed to find an answer through Googling and hoping some of you have handled this one.)
Related
I came across this cute little symbol today:
🔮
I couldn't figure out what it was, so I searched for reverse lookup services and character maps that might be able to reveal a name to no avail. I know, however, that Windows' character map program knows the names of symbols:
How does Windows accomplish this? How might I, but a lowly programmer, divine this same knowledge? What encoding system does Unicode use to tie a symbol to its description?
This information comes from the Unicode Character Database.
Specifically, the code points and their names (and other info like the category of a code point) are defined in UnicodeData.txt.
A lot of programming languages have this information in the standard library, eg. the unicodedata module of Python.
If you just want to know the glyph name, head on over to CodePoints (or Graphemica or probably any one of a dozen other sites) and do a search on it. I'm not sure which lookup services you used "to no avail" but those two have no issues in locating it.
Doing so with 🔮 will lead you to codepoint U+1F52e, which will give you the descriptive name "CRYSTAL BALL", along with all sorts of other useful information about it.
I have to write some code that translate numbers from english to french (from 1 to 999) using the DCG formalism of Prolog. Do I have to write down two separate grammar rules (one for english and one for french)or not?
Can this piece of code found on the internet help me?
https://groups.google.com/forum/?fromgroups=#!topic/comp.lang.prolog/ZF8p5cs4q0U
Please Help.
You can do it in two steps (english to number, number to french) or you can try doing it directly from english to french. The two-step option is more generic (i.e. would let you convert it both ways, and you can easily extend it to support more languages), and you already have a working code available (the one in the linked topic), so I suggest following this route.
Just remember that, the same way a DCG rule allows you to parse some text, it allows you to generate it too. As the linked topic shows:
?- phrase(number(N), [one, hundred, and, twenty, seven]).
N = 127
?- phrase(number(127), L).
L = [one, hundred, and, twenty, seven]
If you replace the second part with phrase(number_fr(127), L), using the rules you implemented, you'd have the number you parsed earlier expressed in french.
In my app before I send a string off I need to work out if the text entered in the textbox is a UK Postcode. I don't have the regex ability to work that out for myself and after searching around I can't seem to work it out! Just wondered if anyone has done a similar thing in the past?
Or if anyone can point me in the right direction I would be most appreciative!
Tom
Wikipedia has a good section about this. Basically the answer depends on what sort of pathological cases you want to handle. For example:
An alternative short regular expression from BS7666 Schema is:
[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2}
The above expressions fail to exclude many non-existent area codes (such as A, AA, Z and ZY).
Basically, read that section of Wikipedia thoroughly and decide what you need.
for post codes without spaces (e.g. SE19QZ) I use: (its not failed me yet ;-) )
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})
if spaces (e.g. SE1 9QZ) , then:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) {0,1}[0-9][A-Za-z]{2})$
You can match most post codes with this regex:
/[A-Z]{1,2}[0-9]{1,2}\s?[0-9]{1,2}[A-Z]{1,2}/i
Which means... A-Z one or two times ({1,2}) followed by 0-9 1 or two times, followed by a space \s optionally ? followed by 0-9 one or two times, followed by A-Z one or two times.
This will match some false positives, as I can make up post codes like ZZ00 00ZZ, but to accurately match all post codes, the only way is to buy post code data from the post office - which is quite expensive. You could also download free post code databases, but they do not have 100% coverage.
Hope this helps.
Wikipedia has some regexes for UK Postcodes: http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
Are there any modules that can help me compare phone numbers for equality?
For example, the following three numbers are equivalent (when dialling from the UK)
+44 (0)181 1234123
00441811234123
0181 1234123
Is there a perl module that can tell me this?
The closest I can see on CPAN is Number::Phone which is an active project, and supports UK Phone numbers. It should work for the specific example you give. A few countries are supported.
If you've got phone numbers for other countries things could get more difficult due to local formatting idiosyncrasies.
Supposing that the code you need doesn't exist, and you have to write it yourself, there are two basic operations that you need to do:
Apply context. This is where you take the location of the dialing phone into account. If the call isn't international, you supply the country code; if the call isn't long-distance, you provide an area code, etc. This requires some rules per-locale, of course.
Normalize. Remove meaningless spaces and punctuation, convert the international dialing prefix ("011" in NANPA, "00" in most of the rest of the world, but occasionally many weirder things) to the standard "+".
After completing those two steps properly, all inputs that are actually equivalent numbers should give identical output strings.
I have a device with some documentation on how to send it text. It uses 0x00-0x7F to send 'special' characters like accented characters, euro signs, ...
I am guessing they copied an existing code page and made some changes, but I have no idea how to figure out what code page is closest to the one in my documentation.
In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find some way to go through all code pages and find the ones that have this character on that position, it would be a piece of cake.
However, all I can find on the internet are links to code page dumps just like the one I'm looking at, or software that uses heuristics to read text and guess the most likely code page. Surely someone out there has made it possible to look up what code page one is looking at ?
If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?
In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).
If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.
There is no way to auto-detect the codepage without additional information. Below the display layer it’s just bytes and all bytes are created equal. There’s no way to say “I’m a 0x41 from this and that codepage”, there’s only “I’m 0x41. Display me!”
What endian is the system? Perhaps you're flipping bit orders?
In most codepages, 0x41 is just the normal "A", I don't think any standard codepages have "Á" in that position. It could have a control character somewhere before the A that added the accent, or uses a non-standard codepage.
I don't see any use in knowing the "closest codepage", you just need to use the docs you got with the device.
Your last sentence is puzzling, what do you mean by "possible to look up what code page one is looking at"?
If you include your whole codepage, people here on SO could be more helpful and give you more insight about this issue, having one data point 0x41=Á doesn't help much.
Somewhat random idea, but if you can get replicate a significant amount of the text off the device, you could try running it through something like the detect function in http://chardet.feedparser.org/.