how to validate special characters from utf8 character set in zend php - zend-framework

I am using multilingual characters, use utf8 encoding. Now i need to validate it and avoid special characters while entering it.Is there any way to identify special characters while using multilingual character inputs? what i mean is we can validate special chars like !#%%^&.. while using english.I am looking for the same type of validation.
anybody please help me...
I am using zend php.
thanks in advance.

There's only one "!", "#", "%", "^" and "&" character respectively. There's not an "English !" and a "Spanish !" and an "Indian !" and a "Korean !". They're all the same character. If they're encoded in UTF-8, they're even the same byte as ASCII encoded characters. You can look for them and replace them as before.
There may be characters that look similar, like "!" (fullwidth exclamation mark), but that's not the same character as "!" and hence does not have a special meaning if "!" has a special meaning.

Use the Zend_Validate_Alnum validator, it's unicode ready.

Related

Perl regex presumably removing non ASCII characters

I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.

Unicode characters xn--ls8h

I want to know how to write the unicode Emoji characters in this form "xn--ls8h" <-- that is the pile of poo emoji unicode character. I had never seen this form, always something like &#5623*; (no asterisk) or something like that... What is this "xn--" form and how do I convert to it? Thanks!
xn-- is the prefix used in the ASCII representation of an Internationalized Domain Name, and ls8h is the Punycode representation of the character.
In Python, Punycode is one of the standard character encodings:
>>> b'ls8h'.decode('punycode')
'\U0001f4a9'

to extract characters of a particular language

how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets
This depends on a few factors:
Is the string encoded with UTF-8?
Do you want all non-English characters, including things like symbols and punctuation marks, or only non-symbol characters from written languages?
Do you want to capture characters that are non-English or non-Latin? What I mean is, would you want characters like é and ç or would you only want characters outside of Romantic and Germanic alphabets?
and finally,
What programming language are you wanting to do this in?
Assuming that you are using UTF-8, you don't want basic punctuation but are okay with other symbols, and that you don't want any standard Latin characters but would be okay with accented characters and the like, you could use a string regular expression function in whatever language you are using that searches for all non-Ascii characters. This would elimnate most of what you probably are trying to weed out.
In php it would be:
$string2 = preg_replace('/[^(\x00-\x7F)]*/','', $string1);
However, this would remove line endings, which you may or may not want.

Is there a name / set for characters that can be typed using a standard english keyboard?

Is there a name / set for characters that can be typed using a standard english keyboard?
The phrase I think you are looking for is the Latin alphabet, or the ASCII character set.
Check out ASCII printable characters
(you can also use the term Graphic character)

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.