Search a text in selected coding system in file hierarchy - emacs

I want to search for text in a specified coding system (cp1251/UTF-8/UTF-16-le/iso-8859-4, etc) in a file hierarchy.
For example I have source code in cp1251 coding and I run Debian with system coding UTF-8. grep or Midnight Commander perform searches in UTF-8 coding. So I can not find Russian words.
Preferred solutions will use standard POSIX or GNU command line utilities (like grep).
MC or Emacs solution also appreciated.
I tried:
$ grep `echo Привет | iconv -f cp1251 -t utf-8` *
but this command does not show results sometimes.

The command you proposed outputs the string Привет, then pipes the result of that output to iconv and applies grep to the result of iconv. That is not what you want. What you want is this:
find . -type f -printf "iconv -f cp1251 -t utf-8 '%p' | grep --label '%p' -H 'Привет'\n" | sh
This applies iconv, followed by grep, to every file below the current directory.
But note that this assumes that all of your files are in CP1251. It will fail if only some of them are. In that case you'd first have to write a program that detects the encoding of a file and then applies iconv only if necessary.

From the command line:
LANG=ru_RU.cp1251 grep Привет *

Related

Formating file changes encoding on Redhat system

I have a bash script which extract data from an oracle database. I use spool to extract data. After extraction I format the file by removing and replacing some characters. My problem is after formating the files are in ANSI encoding instead of ut8.
Extraction with spool. The file is utf8
Format with cat and tr command and redirect in another file. This file is ansi.
The same process works fine on Aix system. I try iconv but it doesnt work. Do you please have an idea why the encoding changes from utf8 to ansi ? How to correct it please ?
You should consequently use either ISO-8859-1 or UTF-8. In the latter case, don't use tr as it doesn't (yet?) support multi-byte characters, use sed instead (e.g sed 's/deletethis//g').
ISO-8859-1:
export LC_CTYPE=fr_FR.ISO-8859-1
export NLS_LANG=French_France.WE8ISO8859P1
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.latin1
# or the same with hex-codes:
sed $'s/\xea/[e-circumflex]/g' test.latin1
UTF-8:
export LC_CTYPE=fr_FR.UTF-8
export NLS_LANG=French_France.AL32UTF8
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.utf8
# or the same with hex-codes:
sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8
Note: no conversion (iconv, recode, etc) is required, just make sure NLS_LANG and LC_CTYPE are compatible. (Also, your terminal(emulator) should be set accordingly; for PuTTY it is Configuration/Category/Window/Translation/Remote-character-set.)
Original answer:
I cannot tell what's wrong with the formatting you perform, but here is a method to damage the utf8-encoded text:
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3 ..RV..ZT..R.. T.
00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a .K..RF..R..G..P.
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352 .RV.ZT..R. T.K.R
00000010: 46c3 52c3 47c3 500a F.R.G.P.
Here the tr -d $'\200-\237' part deleted half of the utf8-sequences (c381 became c3, c590 became c5), rendering the text unusable.

Converting only non utf-8 files to utf-8

I have a set of md files, some of them are utf-8 encoded, and others are not (windows-1256 actually).
I want to convert only non-utf-8 files to utf-8.
The following script can partly do the job:
for file in *.md;
do
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
done
I still need to exclude the original utf-8 files from this process, (maybe using file command?). Try the following command to understand what I mean:
file --mime-encoding *
Notice that although file command isn't smart enough to detect the right character set of non-utf-8 files, It's enough in this case that it can distinguish between utf-8 and non-utf-8 files.
Thanks in advance for help.
You can use for example an if statement:
if file --mime-encoding "$file" | grep -v -q utf-8 ; then
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
fi
If grep doesn't find a match, it returns a status code indicating failure. The if statement tests the status code

Replace all Windows-1252 characters to the respective UTF-8 ones

I looking to a way to recursively replace all imcompatible windows-1252 caracteres to the respective utf-8 ones.
I tried iconv, without success.
I also found the following command:
grep -rl oldstring . |xargs sed -i -e 's/oldstring/newstring/'
But I'll not like to exec this command by hand for every charactere.
Is there a way or software that can do that?

How to convert unicode to ASCII?

I must remove Unicode characters from many files (many cpp files!) and I'm looking for script or something to remove these unicode. the files are in many folders!
If you have it, you should be able to use iconv (the command-line tool, not the C function). Something like this:
$ for a in $(find . -name '*.cpp') ; do iconv -f utf-8 -t ascii -c "$a" > "$a.ascii" ; done
The -c option to iconv causes it to drop characters it can't convert. Then you'd verify the result, and go over them again, renaming the ".ascii" files to the plain filenames, overwriting the Unicode input files:
$ for a in $(find . -name '*.ascii') ; do mv $a $(basename $a .ascii) ; done
Note that both of these commands are untested; verify by adding echo after the do in each to make sure they seem sane.
Open the srt file in Gaupol, click on file, click on save as, drop menu for character encoding, select UTF-8, save the file.

How to search and replace in text files only?

I have a directory containing a bunch of files, some text some binary, with no consistent naming. I want to search and replace a string in text files only. So I went with:
perl -i -pne 's#/some/text/to/replace#/replacement/text#' *
Remove the -i option and you will see that binary files get caught. How do I modify this one-liner to skip binary files?
ack -n --text --sort -f . | xargs perl -i -pne 's…'
Abusing ack goes much quicker than writing your own solution with -T.
Well, this is all based on what your definition of a text file is. Perl 5 has the -T filetest operator that will tell you if a filename or filehandle is a text file (using Perl 5's definition):
perl -i -pne 'BEGIN{#ARGV=grep-T,#ARGV}s#regex#replacement#' *
The BEGIN block will filter out any files that don't pass the -T test, so they won't even be read (except for their first block because that is what -T uses to determine if they are text).
From perldoc -f -X
The -T and -B switches work as follows. The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set. If too many strange characters (>30%) are found, it's a -B file; otherwise it's a -T file. Also, any file containing a zero byte in the first block is considered a binary file. If -T or -B is used on a filehandle, the current IO buffer is examined rather than the first block. Both -T and -B return true on an empty file, or a file at EOF when testing a filehandle. Because you have to read a file to do the -T test, on most occasions you want to use a -f against the file first, as in next unless -f $file && -T $file .