I am trying to find and replace single characters in a number of text files in a directory. Apologies for the possible duplication but I have not been able to find an answer in other sed threads.
I used Homebrew to install gnu-sed, and I'm using the command:
find . -name "*.txt" -exec gsed -i -e 's/ñ/–/g' '{}' \;
I have a 'test' file containing the characters I need replacing in the directory, and these are all found and replaced correctly. But other characters in other text files are not. e.g. 'Weíre to Denmark ñ all' (ñ also isn't found/ replaced).
Why might this be? How can I fix it? Thank you!
Edit - Output of
$ od -c filethatworks.txt | head -2
0000000 – ** ** \n – ** ** \n “ ** ** \n “ ** ** \n
0000020 — ** ** \n — ** ** \n - \n “ ** ** \n “ **
$ od -c filethatdoesnot.txt | head -2
0000000 T h o s e b l e s s e d d a
0000020 y s o f s u m m e r a r e
For a file that works, the file command returns
test.txt: UTF-8 Unicode text
and for one that does not:
ca001_mci_17071971.txt: Non-ISO extended-ASCII text, with very long lines, with CRL
F line terminators
Characters are human concepts. When characters are to be represented in computer files, they need to be encoded. Encoding associates each character with an integer called a code point.
For example, take the character "ă" (that's lower case "a" with a breve on top, used in Romanian spelling for the vowel /ə/); in the old days of MS-DOS, we quite often used an encoding called "code page 852", where "ă" has the code point 199. Then Windows came, and on Windows we often used an encoding called "code page 1250", where "ă" has the code point 227. Then came Unicode, and in Unicode "ă" has the code point 259.
Since Unicode code points can have values larger than 255, there must be a way to represent them using bytes with values between 0 and 255. Those methods are called "Unicode Transformation Formats" (UTF), of which the most widely used are UTF-8 (very popular in Linux) and UTF-16 (of two kinds, little and big endian, and very popular on Windows). In UTF-8, "ă" is represented as two bytes, with the values 196 and 131 (by the rules of UTF-8, those two bytes together represent code point 259); in little endian UTF-16, "ă" is represented by two bytes, with the values 3 and 1 (by the rules of little endian UTF-16, those two bytes together represent code point 259).
The point is that in order to make sense of a text file you need to know (1) what encoding is used, and (2) in the case of Unicode, what transformation format is used. Now, on Linux and on the Web we are very close to a consensus that all text is represented in UTF-8; nevertheless, old files still exist, and occasionally new files come from Windows, so there is a very nice program called iconv (available both on Linux and on Windows) which is used to translate text files from one encoding into another.
For example, assuming that your problematic file is encoded in Windows-1252 (also called ANSI by the Windows documentation, although the American National Standards Institute had nothing to do with it), you could say
iconv -f windows-1250 -t utf-8 ca001_mci_17071971.txt | gsed -e 's/ñ/–/g' '{}'
Sadly, there is no way to use sed -i; you must write a temporary output file, then rename the temporary output file on top of the source file, of course after checking that everything went well.
Related
I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.
I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:
grep -P "[\x1F6-\x1F1]" contact_names.tokens
grep: range out of order in character class
https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f
You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value.
So the regex should be "[\x{1F1FF}-\x{1F600}]".
The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:
grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]" contact_names.tokens
(The range is borrowed from Suhail Gupta's answer on a similar question)
If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.
You could use ugrep as a drop-in replacement for grep to do this:
ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens
ugrep matches Unicode patterns by default (disabled with option -U).
The regular expression syntax is POSIX ERE compliant, extended with
Unicode character classes, lazy quantifiers, and negative patterns to
skip unwanted pattern matches to produce more precise results.
ugrep searches UTF-encoded input when UTF BOM (byte order mark) are
present and ASCII and UTF-8 when no UTF BOM is present. Option
--encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.
ugrep searches text and binary files and produces hexdumps for binary matches.
The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html
I receive a .txt file with a lot of <96> which should be space instead.
In vi, I have done:
:%s/<96>//g
or
:%s/\<96>\//g
but it is still there. I did dos2unix, but it still doesn't remove it. Is it Unicode? If yes, how can I remove it? Thank you!
There's a good chance those aren't the four literal characters <, 9, 6 and >. Instead, they're probably the single character formed by the byte 0x96, which Vim renders as <96>.
You can see that by executing (from bash):
printf '123\x96abc\x96def' > file.txt ; vi file.txt
and you should see:
123<96>abc<96>def
To get rid of them, you can just use sed with something like (assuming your sed has in-place replacement):
sed -i.save 's/\x96//g' file.txt
You can also do this within vim itself, you just have to realise that you can enter arbitrary characters with CTRL-V (or CTRL-Q if CTRL-V is set up for paste). See here for details, paraphrased and shortened here to ensure answer is self-contained:
It is possible to enter any character which can be displayed in your current encoding, if you know the character value, as follows (^V means CTRL-V, or CTRL-Q if you use CTRL-V to paste):
Decimal: ^Vnnn, 000..255.
Octal: ^Vonnn, 000..377.
Hex: ^Vxnn, 00..ff.
Hex, BMP Unicode: ^Vunnnn, 0000..FFFF.
Hex, any Unicode: ^VUnnnnnnnn, 00000000..7FFFFFFF.
In all cases, initial zeros may be omitted if the next character typed is not a digit in the given base (except, of course, that the value zero must be entered as at least one zero).
Hex digits A-F, when used, can be typed in upper or lower case, or even in any mixture of them.
The key sequence you therefore want (assuming you want them replaced with spaces) is:
:%s/<CTRL-V>x96/ /g
I use the iconv library to interface from a modern input source that uses UTF-8 to a legacy system that uses Latin1, aka CP1252 (superset of ISO-8859-1).
The interface recently failed to convert the French string "Éducation", where the "É" was encoded as hex 45 CC 81. Note that the destination encoding does have an "É" character, encoded as C9.
Why does iconv fail converting that "É"? I checked that the iconv command-line tool that's available with MacOS X 10.7.3 says it cannot convert, and that the PERL iconv module fails too.
This is all the more puzzling that the precomposed form of the "É" character (encoded as C3 89) converts just fine.
Is this a bug with iconv or did I miss something?
Note that I also have the same issue if I try to convert from UTF-16 (where "É" is encoded as 00 C9 composed or 00 45 03 01 decomposed).
Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.
When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.
However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.
I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.
Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.
#!/usr/bin/perl
use Encode qw/decode_utf8 encode_utf8/;
use Unicode::Normalize;
while (<>) {
print encode_utf8( NFC(decode_utf8 $_) );
}
Use a normalizer (in this case, to Normalization Form C) before calling iconv.
A program that deals with character encodings (different representations of characters or, more exactly, code points, as sequences of bytes) and converting between them should be expected to treat precomposed and composed forms as distinct. The decomposed É is two code points and as such distinct from the precomposed É, which is one code point.
I need the list of ranges of Unicode characters with the property Alphabetic as defined in http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic. However, I cannot find them in the Unicode Character Database no matter how I search for them. Can somebody provide a list of them or just a search facility for characters with specified Unicode properties?
The Unicode Character Database comprises all the text files in the distribution. It is not just a single file as it once was long ago.
The Alphabetic property is a derived property.
You really do not want to use code point ranges for this. You want to use the property properly. That’s because there are just too many of them. Using the unichars script, we learn that there are more than ten thousand just in the Basic Multilingual Plane alone not counting Han or Hangul:
$ unichars '\p{Alphabetic}' | wc -l
10052
If we include the other 16 astral planes, now we’re at fourteen thousand:
$ unichars -a '\p{Alphabetic}' | wc -l
14736
And if we include Han and Hangul, which in fact the Alphabetic property does, we just blew the roof off of a hundred thousands code points:
$ unichars -ua '\p{Alphabetic}' | wc -l
101539
I hope you can see that you do not want to specifically enumerate these using code point ranges. Down that road lies madness.
By the way, if you find the unichars script useful,
you might also like the uniprops script and perhaps the uninames script.
Derived Core Properties can be calculated from the other properties.
The Alphabetic property is defined as: Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
So, if you take all the characters in Lu, Ll, Lt, Lm, Lo, Nl, and all the characters with the Other_Alphabetic property, you will have the Alphabetic characters.
Citation from your source: Generated from: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
These Abbrevations seem to be explained here.
I found the UniView web application which provides a nice search interface. Searching for the Letter property (with Local unchecked) gives 14723 results...