Convert double-byte numbers and spaces in filenames to ASCII - unicode

Given a directory of filenames consisting of double-byte/full-width numbers and spaces (along with some half-width numbers and underscores), how can I convert all of the numbers and spaces to single-byte characters?
For example, this filename consists of a double-byte number, followed by a double-byte space, followed by some single-byte characters:
2 2_3.ext
and I'd like to change it to all single-byte like so:
2 2_3.ext
I've tried convmv to convert from utf8 to ascii, but the following message appears for all files:
"ascii doesn't cover all needed characters for: filename"

You need either (1) normalization from Java 1.6 (java.text.Normalizer), or (2) ICU, or (3 (unlikely)) a product sold by the place I work.

What tools do you have available? There are Unicode normalisation functions in several scripting languages, for example in Python:
for child in os.listdir(u'.'):
normal= unicodedata.normalize('NFKC', child)
if normal!=child:
os.rename(child, normal)

Thanks for your quick replies, bmargulies and bobince. I found a Perl module, Unicode::Japanese, that helped get the job done. Here is a bash script I made (with help from this example) to convert filenames in the current directory from full-width to half-width characters:
#!/bin/bash
for file in *;do
newfile=$(echo $file | perl -MUnicode::Japanese -e'print Unicode::Japanese->new(<>)->z2h->get;')
test "$file" != "$newfile" && mv "$file" "$newfile"
done

Related

Issue matching Chinese characters in Perl one liner using \p{script=Han}

I'm really stumped by trying to match Chinese characters using a Perl one liner in zsh. I canot get \p{script=Han} to match Chinese characters, but \P{script=Han} does.
Task:
I need to change this:
一
<lb/> 二
to this:
<tag ref="一二">一
<lb/> 二</tag>
There could be a variable number of tags, newlines, whitespaces, tabs, alphanumeric characters, digits, etc. between the two Chinese characters. I believe the most efficient and robust way to do this would be to look for something that is *not a Chinese character.
My attempted solution:
perl -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g'
This has the desired effect when applied to the example above.
Problem:
The issue I am having is that \P{script=Han} (or \p{^script=Han}) matches Chinese characters as well.
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters. When trying to match \P{script=Han}, the regex matches every character in the file.
I don't know why.
This is a problem because in the case of this situation, the output is not as desired:
一
<lb/> 三二
becomes
<tag ref="一二">一
<lb/> 三二</tag>
I don't want this to be matched at all- just instances where 一 and 二 are separated only by characters that are not Chinese characters.
Can anyone tell me what I'm doing wrong? Or suggest a workaround? Thanks!
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters.
The problem is that both your script and your input file are UTF-8 encoded, but you do not say so to perl. If you do not tell perl, it will assume that they are ASCII encoded.
To say that your script is UTF-8 encoded, use the utf8 pragma. To tell perl that all files you open are UTF-8 encoded, use the -CD command line option. So the following oneliner should solve your problem:
perl -Mutf8 -CD -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g' file

use crunch to generate all the possible IATA codes comsbination

I don't really know how to formulate this, but I have a bunch of IATA codes, and I want to generate all the possible combinations ex : JFK/LAX, BOS/JFK, ...etc, separated by a character such as "/" or "|".
Here we assume your IATA codes are stored in the file file; one code per line.
crunch has the -q option which generates permutations of lines from a file. However, in this mode crunch ignores most of the other options like <max-len>, which would be important here to print only pairs of codes.
Therefore, it would be easier and faster to …
Use something different than crunch
For instance, try
join -j2 -t/ -o 1.1 2.1 file file | awk -F/ '$1!=$2'
If you really, really, really want, you can …
Translate the input into something crunch can work with
We translate each line from file to a unique single character, supply that list of characters to crunch, and then translate the result back.
crunch supports Unicode characters, so files with more than 255 lines are totally fine. Here we enumerate the lines in file by characters in Unicode's Supplementary Private Use Area-A. Therefore, file may have at most 65'534 lines.
If you need more lines, you could combine multiple Unicode planes, but at some point you might run into ARG_MAX issues. Also, with 65'534 lines you would already generate (a bit less than) 65'534^2 = 4'294'705'156 pairs, occupying more than 34 GB when translated into pairs of IATA codes.
I suspect the back-translation to be a huge slowdown, so above alternative seems to be better in every aspect (efficiency, brevity, maintainability, …).
# This assumes your locale is using any Unicode encoding,
# e.g. UTF-8, UTF-16, … (doesn't matter which one).
file=...
((offset=0xF0000))
charset=$(
echo -en "$(bc <<< "obase=16;
max=$offset+$(wc -l < "$file");
for(i=$offset;i<max;i++) {\"\U\"; i}" |
tr -d \\n
)"
)
crunch 2 2 "$charset" -d 1# --stdout |
iconv -t UTF-32 |
od -j4 -tu4 -An -w12 -v |
awk -v o="$offset" 'NR==FNR{a[o+NR-1]=$0;next} {print a[$1]"/"a[$2]}' "$file" -

How do I find a 4 digit unicode character using this perl one liner?

I have a file with this unicode character ỗ
File saved in notepad as UTF-8
I tried this line
C:\blah>perl -wln -e "/\x{1ed7}/ and print;" blah.txt
But it's not picking it up. If the file has a letter like 'a'(unicode hex 61), then \x{61} picks it up. But for a 4 digit unicode character, I have an issue picking up the character.
You had the right idea with using /\x{1ed7}/. The problem is that your regex wants to match characters but you're giving it bytes. You'll need to tell Perl to decode the bytes from UTF-8 when it reads them and then encode them to UTF-8 when it writes them:
perl -CiO -ne "/\x{1ed7}/ and print" blah.txt
The -C option controls how Unicode semantics are applied to input and output filehandles. So for example -CO (capital 'o' for 'output') is equivalent to adding this before the start of your script:
binmode(STDOUT, ":utf8")
Similarly, -CI is equivalent to:
binmode(STDIN, ":utf8")
But in your case, you're not using STDIN. Instead, the -n is wrapping a loop around your code that opens each file listed on the command-line. So you can instead use -Ci to add the ':utf8' I/O layer to each file Perl opens for input. You can combine the -Ci and the -CO as: -CiO
Your script works fine. The problem is the unicode you're using for searching. Since your file is utf-8 then your unique search parameters need to be E1, BB, or 97. Check the below file encoding and how that changes the search criteria.
UTF-8 Encoding: 0xE1 0xBB 0x97
UTF-16 Encoding: 0x1ED7
UTF-32 Encoding: 0x00001ED7
Resource https://www.compart.com/en/unicode/U+1ED7

How to sed replace UTF-8 characters with HTML entities in another file?

I'm running cygwin under windows 10
Have a dictionary file (1-dictionary.txt) that looks like this:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
The separators between are TABs (\ts).
The dictionary file is encoded as UTF-8.
Want to replace words and symbols in the first column with words and HTML entities in the second column.
My source file (2-source.txt) has the target UTF-8 and ASCII symbols. The source file also is encoded as UTF-8.
Sample text looks like this:
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
I run the following sed one-liner in a shell script (./3-script.sh):
sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt
The substitution of English (en-GB) words with American (en-US) words in 3-translation.txt is successful.
However the substitution of ASCII symbols, such as the quote symbol, and UTF-8 words produces this result:
vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)
If i use only the specific symbol (not the full word) I get results like this:
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
The ASCII quote symbol is appended with " - it is not replaced.
Similarly, the UTF-8 symbol is appended with its HTML entity - not replaced with the HTML entity.
The expected output would look like this:
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
How to modify the sed script so that target ASCII and UTF-8 symbols are replaced with their HTML entity equivalent as defined in the dictionary file?
I tried it, just replace all & with \& in your 1-dictionary.txt will solve your problem.
Sed's substitute uses a regex as the from part, so when you use it like that, notice those regex characters and add \ to prepare them to be escaped.
And the to part will have special characters too, mainly \ and &, add extra \ to prepare them to be escaped too.
Above linked to GNU sed's document, for other sed version, you can also check man sed.

Forcing a mixed ISO-8859-1 and UTF-8 multi-line string into UTF-8 in Perl

Consider the following problem:
A multi-line string $junk contains some lines which are encoded in UTF-8 and some in ISO-8859-1. I don't know a priori which lines are in which encoding, so heuristics will be needed.
I want to turn $junk into pure UTF-8 with proper re-encoding of the ISO-8859-1 lines. Also, in the event of errors in the processing I want to provide a "best effort result" rather than throwing an error.
My current attempt looks like this:
$junk = force_utf8($junk);
sub force_utf8 {
my $input = shift;
my $output = '';
foreach my $line (split(/\n/, $input)) {
if (utf8::valid($line)) {
utf8::decode($line);
}
$output .= "$line\n";
}
return $output;
}
Obviously the conversion will never be perfect since we're lacking information about the original encoding of each line. But is this the "best effort result" we can get?
How would you improve the heuristics/functionality of the force_utf8(...) sub?
I have no useful advice to offer except that I would have tried using Encode::Guess first.
You might be able to fix it up using a bit of domain knowledge. For example, é is not a likely character combination in ISO-8859-1; it is much more likely to be UTF-8 é.
If your input is limited to a restricted pool of characters, you can also use a heuristic such as assuming à will never occur in your input stream.
Without this kind of domain knowledge, your problem is in general intractable.
Just by looking at a character it will be hard to tell if it is ISO-8859-1 or UTF-8 encoded. The problem is that both are 8-bit encodings, so simply looking at the MSb is not sufficient. For every line, then, I would transcode the line assuming it is UTF-8. When an invalid UTF-8 encoding is found re-transcode the line assuming that the line is really ISO-8859-1. The problem with this heuristic is that you might transcode ISO-8859-1 lines that are also well-formed UTF-8 lines; however without external information about $junk there is no way to tell which is appropriate.
Take a look at this article. UTF-8 is optimised to represent Western language characters in 8 bits but it's not limited to 8-bits-per-character. The multibyte characters use common bit patterns to indicate if they are multibyte, and how many bytes the character uses. If you can safely assume only the two encodings in your string, the rest should be simple.
In short, I opted to solve my problem with "file -bi" and "iconv -f ISO-8859-1 -t UTF-8".
I recently ran across a similar problem in trying to normalize the encoding of file names. I had a mixture of ISO-8859-1, UTF-8, and ASCII. As I realized wile processing the files I had added complications caused by the directory name having one encoding that was different then the file's encoding.
I originally tried to use Perl but it could not properly differentiate between UTF-8 and ISO-8859-1 resulting in garbled UTF-8.
In my case it was a one time conversion on a reasonable file count, so I opted for a slow method that I knew about and worked with no errors for me (mostly because only 1-2 non-adjacent chars per line used special ISO-8859-1 codes)
Option #1 converts ISO-8859-1 to UTF-8
cat mixed_text.txt |
while read i do
type=${"$(echo "$i" | file -bi -)"#*=}
if [[ $type == 'iso-8859-1' ]]; then
echo "$i" | iconv -f ISO-8859-1 -t UTF-8
else
echo "$i"
fi
done > utf8_text.txt
Option #2 converts to ISO-8859-1 to ASCII
cat mixed_text.txt |
while read i do
type=${"$(echo "$i" | file -bi -)"#*=}
if [[ $type == 'iso-8859-1' ]]; then
echo "$i" | iconv -f ISO-8859-1 -t ASCII//TRANSLIT
else
echo "$i"
fi
done > utf8_text.txt