remove all non-ASCII from string - unicode

My question is general - I want to ask if there is any special modules in programming languages or a ready program which will allow me to accomplish my task.
Is there any convenient way (other than writing own functions with multiple replace statement) to automatically substitute all national characters to corespondents letters? For example, I want to substitute æ to ae, ä to a, ę to e and so on.
If it's impossible to prepare universal function, is there any ready function in currently used programming languages, which will remove such characters simply by limiting allowed char only to those from standard Latin alphabet?

There is unidecode, which is available for several languages (perl, python, java). I've previous written about it in this answer.
>>> from unidecode import unidecode
>>> unidecode(u"İstanbul")
'Istanbul'
>>> unidecode(u"\u5317\u4EB0")
'Bei Jing '

Transliteration is the word you're looking for :)
In php, that is achieved through iconv:
http://php.net/manual/en/function.iconv.php
As others have said, it's probably best to keep everything in Unicode (utf8 or 16) if possible.

I do not now what language you are using but in php you can do
$text = preg_replace("/[^a-zA-Z0-9]+/", "", $text);
you can change the reg exp to allow more/less characters.

In PHP, you can scan the files in a directory:
<?php
$dir = '';
if ($handle = opendir($dir)) {
while (false !== ($file = readdir($handle))) {
if ($file[0] == '.' || is_dir($dir.'/'.$file)) {
continue;
}
//functions here
}
closedir($handle);
}
?>
Then rename them all with this regex:
$newname = ereg_replace("[^A-Za-z0-9]", "", $oldname);
You would set $oldname to the filename of each file in the directory, and have it where //functions is, which would go through each file in the directory and rename it according to the regex.

If your input is Unicode, you can apply the Unicode normalization NKFD to approximate what you want. Python has this built-in. After normalization, you can strip the accents, which will have been separated from the letters they belong to.
>>> import unicodedata
>>> s = u"äçéì" # u"" makes a Unicode string in Python 2.x
>>> unicodedata.normalize("NFKD", s).encode("ascii", errors="ignore")
'acei'
This won't work for æ, though.

Related

How can I get powershell to write þ (lowercase thorn) to a file as 0xfe?

I am attempting to write a PS script that builds and executes a script file for Rocket Software's SBClient. The scripting language uses two different delimiters, þ (lowercase thorn) (0xFE) and ü (u with umlaut) (0xFC).
Each of these gets written to files as two characters. þ is written as þ (A with tilde and 3/4) (0xC3 0xBE). ü gets written as ü (A with tilde and 1/4) (0xC3 0xBC).
I have tried multiple different methods to write the file and it comes up the same way every time. I'm sure this is because these are extended ASCII characters.
Is there a way to write these to a text file with their proper two-character hex codes without converting the string to hex and writing a binary file? If not, what is the best way to convert the string to hex for this? I have seen a few different examples in other languages, but nothing really solid in PS.
It looks like I could convert the string to an array of bytes and then use io.file::WriteAllBytes() to write the file. I was just hoping there was a better way to do this.
Here is the pertinent code...
$ScriptFileContent = 'TUSCRIPTþþþ[Company Name] Logon Please:þ{enter}üPST{enter}þ2þ'
$ScriptFilePath = ([Environment]::GetFolderPath("ApplicationData")).ToString() + "\Rocket Software\SBClient\tuscript\NT"
out-file -filepath $ScriptFilePath -inputobject $ScriptFileContent -encoding ascii
Solution
$enc = [System.Text.Encoding]::GetEncoding("iso-8859-1")
$ScriptFileContent = 'TUSCRIPTþþþ[Company Name] Logon Please:þ{enter}üPST{enter}þ2þ'
$ScriptFileContent = $enc.GetBytes($ScriptFileContent)
$ScriptFilePath = ([Environment]::GetFolderPath("ApplicationData")).ToString() + "\Rocket Software\SBClient\tuscript\NT"
[io.file]::WriteAllBytes($ScriptFilePath, $ScriptFileContent)
Thanks for your help!
What you're seeing are your chars, outside ASCII, being encoded as UTF-8. You have two choices here:
either you use [System.Text.Encoding]::GetEncoding("iso-8859-1") to write your file as Latin1
or you use the FileStream.WriteByte() method of the result of io.file.Open to directly write the 0xFE and 0xFC bytes yourself (seems less overkill, but that depends how you write the rest of the data)

preg_match a keyword variable against a list of latin and non-latin chars keywords in a local UTF-8 encoded file

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.
How do I go about matching both Latin and non-Latin keywords.
The badwords.txt file includes one word per line as in this example
bad
nasty
racist
سفالة
وساخة
جنس
Code used for matching:
$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);
foreach ($badwords as $key => $val) {
if (!empty($val)) {
$val = trim($val);
$regexp = "/\b" . $val . "\b/i";
if (preg_match($regexp, $query))
$badFlag = 1;
if ($badFlag == 1) {
// Bad word detected die...
}
}
}
I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.
The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set
$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';
and modify the regexp as follows:
$regexp = "/" . $wstart . $val . $wend . "/iu";
Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.
It looks like strtolower() is one of them, you need to use mb_strtolower($query, 'UTF-8'). If that doesn't fix it, you'll need to read through the code and find every point where you process $query or badwords.txt and check the documentation for UTF-8 bugs.
As far as I know, preg_match() is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.
Please also double check that badwords.txt is a UTF-8 file and that $query contains a valid UTF-8 string (if it's coming from the browser, you set it with a <meta> tag).
If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with View -> Encoding -> Unicode).
You shouldn't need to use iconv or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.

Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:
$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right
... etc.
However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).
The new utf8 pragmas at the start of the script are:
use CGI qw(-utf8);
use open IO => ':utf8';
I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.
And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.
If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...
Encoding::FixLatin was specifically written to help fix data broken in the same manner as yours.
Ikegami already mentioned the Encoding::FixLatin module.
Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:
unless ( utf8::decode($string) ) {
require Encode;
$string = Encode::decode(cp1252 => $string);
}
Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.
You could also use Encode.pm's support for fallback.
use Encode qw[decode];
my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
my ($ordinal) = #_;
return decode('Windows-1252', pack 'C', $ordinal);
});
printf "<%s>\n",
join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;
Output:
<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>
Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as
open $filehandle, '<:encoding(cp1252)', $filename or die ...;
and everything (tm) should work.
If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

Allowed characters in filename

Where can I find a list of allowed characters in filenames, depending on the operating system?
(e.g. on Linux, the character : is allowed in filenames, but not on Windows)
You should start with the Wikipedia Filename page. It has a decent-sized table (Comparison of filename limitations), listing the reserved characters for quite a lot of file systems.
It also has a plethora of other information about each file system, including reserved file names such as CON under MS-DOS. I mention that only because I was bitten by that once when I shortened an include file from const.h to con.h and spent half an hour figuring out why the compiler hung.
Turns out DOS ignored extensions for devices so that con.h was exactly the same as con, the input console (meaning, of course, the compiler was waiting for me to type in the header file before it would continue).
OK, so looking at Comparison of file systems if you only care about the main players file systems:
Windows (FAT32, NTFS): Any Unicode except NUL, \, /, :, *, ?, ", <, >, |. Also, no space character at the start or end, and no period at the end.
Mac(HFS, HFS+): Any valid Unicode except : or /
Linux(ext[2-4]): Any byte except NUL or /
so any byte except NUL, \, /, :, *, ?, ", <, >, | and you can't have files/folders call . or .. and no control characters (of course).
On Windows OS create a file and give it a invalid character like \ in the filename. As a result you will get a popup with all the invalid characters in a filename.
To be more precise about Mac OS X (now called MacOS) / in the Finder is interpreted to : in the Unix file system.
This was done for backward compatibility when Apple moved from Classic Mac OS.
It is legitimate to use a / in a file name in the Finder, looking at the same file in the terminal it will show up with a :.
And it works the other way around too: you can't use a / in a file name with the terminal, but a : is OK and will show up as a / in the Finder.
Some applications may be more restrictive and prohibit both characters to avoid confusion or because they kept logic from previous Classic Mac OS or for name compatibility between platforms.
Rather than trying to identify all the characters that are unwanted,
you could just look for anything except the acceptable characters. Here's a regex for anything except posix characters:
cleaned_name = re.sub(r'[^[:alnum:]._-]', '', name)
For "English locale" file names, this works nicely. I'm using this for sanitizing uploaded file names. The file name is not meant to be linked to anything on disk, it's for when the file is being downloaded hence there are no path checks.
$file_name = preg_replace('/([^\x20-~]+)|([\\/:?"<>|]+)/g', '_', $client_specified_file_name);
Basically it strips all non-printable and reserved characters for Windows and other OSs. You can easily extend the pattern to support other locales and functionalities.
I took a different approach. Instead of looking if the string contains only valid characters, I look for invalid/illegal characters instead.
NOTE: I needed to validate a path string, not a filename. But if you need to check a filename, simply add / to the set.
def check_path_validity(path: str) -> bool:
# Check for invalid characters
for char in set('\?%*:|"<>'):
if char in path:
print(f"Illegal character {char} found in path")
return False
return True
Here is the code to clean file name in python.
import unicodedata
def clean_name(name, replace_space_with=None):
"""
Remove invalid file name chars from the specified name
:param name: the file name
:param replace_space_with: if not none replace space with this string
:return: a valid name for Win/Mac/Linux
"""
# ref: https://en.wikipedia.org/wiki/Filename
# ref: https://stackoverflow.com/questions/4814040/allowed-characters-in-filename
# No control chars, no: /, \, ?, %, *, :, |, ", <, >
# remove control chars
name = ''.join(ch for ch in name if unicodedata.category(ch)[0] != 'C')
cleaned_name = re.sub(r'[/\\?%*:|"<>]', '', name)
if replace_space_with is not None:
return cleaned_name.replace(' ', replace_space_with)
return cleaned_name

How can I create a Unicode character from its bytes when they are stored in different variables in Perl?

I am trying to Convert hex representations of Unicode characters to the characters they represent. The following example works fine:
#!/usr/bin/perl
use Encode qw( encode decode );
binmode(STDOUT, ':encoding(utf-8)');
my $encoded = encode('utf8', "\x{e382}\x{af}");
eval { $encoded = decode('utf8', $encoded, Encode::FB_CROAK); 1 }
or print("coaked\n");
print "$encoded\n";
However the hex digits are stored in 3 variables.
So if i replace the encode line with this:
my $encoded = encode('utf8', "\x{${byte1}${byte2}}\x{${byte3}}");
where
my $byte1 = "e3"; my $byte2 = "82"; my $byte3 = "af";
It fails as it tries to evaluate the \x immediately and sees the $ sign and { as characters.
Does anyone know how to get around this.
Instead of
my $encoded = encode('utf8', "\x{${byte1}${byte2}}\x{${byte3}}");
You can use
my $encoded = encode('utf8', chr(hex($byte1 . $byte2)) . chr(hex($byte3)));
hex() converts from hexadecimal, and chr() returns the unicode character for a given code point.
[Edit:]
Not related to your question, but I noticed you mix utf-8 and utf8 in your program. I don't know if this is a typo, but you should be a ware that these are not the same things in Perl:
utf-8 (with hyphen, case insensitive) is what the UTF-8 standard says, whereas utf8 (no hyphen, also case insensitive) is Perls internal encoding, which is more loosely defined (it allows codepoints that are not valid unicode codepoints). In general, you should stick to utf-8 (perlunifaq has the details).
trendel's answer seems pretty good, but Encode::Escape offers an alternative solution:
use Encode::Escape::Unicode;
my $hex = '263a';
my $escaped = "\\x{" . $hex . "}\n";
print encode 'utf8', decode 'unicode-escape', $escaped;
First off, think hard about why you ended up with three variables, $byte1, $byte2, $byte3, each holding one byte's worth of data, as a two-character string, in hex. This part of your program seems hard because of a poor design decision further up. Fix that bad decision, and this part of the code will fall out naturally.
That being said, what you want to do, I think, is this:
my $byte1 = "e3"; my $byte2 = "82"; my $byte3 = "af";
my $str = chr(hex($byte1 . $byte2)) . chr(hex($byte3))
The encoding stuff is a red herring; you shouldn't be worrying about encodings in the middle of your program, only when you do IO.
I'm assuming in the above that you want to get out a two character string, U+E382 followed by U+AF. That's what you actually asked for. However, since there is no U+E382, since it's in the middle of the private use area, that's probably not what you actually wanted. Please try to reword the question? Perhaps ask a more basic question, and describe what you are trying to achieve, rather then how you are going about trying to do it?