How to convert unicode to ASCII? - unicode

I must remove Unicode characters from many files (many cpp files!) and I'm looking for script or something to remove these unicode. the files are in many folders!

If you have it, you should be able to use iconv (the command-line tool, not the C function). Something like this:
$ for a in $(find . -name '*.cpp') ; do iconv -f utf-8 -t ascii -c "$a" > "$a.ascii" ; done
The -c option to iconv causes it to drop characters it can't convert. Then you'd verify the result, and go over them again, renaming the ".ascii" files to the plain filenames, overwriting the Unicode input files:
$ for a in $(find . -name '*.ascii') ; do mv $a $(basename $a .ascii) ; done
Note that both of these commands are untested; verify by adding echo after the do in each to make sure they seem sane.

Open the srt file in Gaupol, click on file, click on save as, drop menu for character encoding, select UTF-8, save the file.

Related

Converting only non utf-8 files to utf-8

I have a set of md files, some of them are utf-8 encoded, and others are not (windows-1256 actually).
I want to convert only non-utf-8 files to utf-8.
The following script can partly do the job:
for file in *.md;
do
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
done
I still need to exclude the original utf-8 files from this process, (maybe using file command?). Try the following command to understand what I mean:
file --mime-encoding *
Notice that although file command isn't smart enough to detect the right character set of non-utf-8 files, It's enough in this case that it can distinguish between utf-8 and non-utf-8 files.
Thanks in advance for help.
You can use for example an if statement:
if file --mime-encoding "$file" | grep -v -q utf-8 ; then
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
fi
If grep doesn't find a match, it returns a status code indicating failure. The if statement tests the status code

Perl one liner to convert from shiftjis to utf8

I am trying the following one liner to convert a file from shiftjis encoding to utf-8 and its not working. Any helpful smart people available?
perl -i.bak -e 'use utf8; use Encode qw(decode encode); my $ustr = Encode::decode("shiftjis",$_); my $val = Encode::encode("utf-8",$ustr); print "$val";' filename
I am pretty new to code pages and the web seems rife with all sorts of complexities on the subject. I just want a one liner. The input file and the output file appear to be the same.
You forgot the -n switch, which will iterate over each line of input, loading one line at a time into $_ and executing the code provided in the -e argument.
More concisely, you could write your program like
perl -MEncode -pi.bak -e '$_=encode("utf-8",decode("shiftjis",$_))' filename
Perl is an odd choice for this, given that there's already a standard utility for doing it:
iconv -f shift-jis -t utf-8 filename
Of course, that doesn't easily let you edit a file in-place, but there's also recode which is likewise installed on my system somehow :)...
recode shift-jis..utf-8 filename
Or use moreutils:
iconv -f shift-jis -t utf-8 filename | sponge filename
Hmm. Seems like TMTOWTDI.

Batch file that removes blank lines and sorts the file (case insensitive) for all encodings

I would like to make a batch file that removes all blank lines and sorts the lines in the files a regular case-insensitive sort.
So far I got this:
#echo off
IF [%1]==[] goto BAR_PAR
IF EXIST %1 (
egrep -v "^[[:space:]]*$" %1 | sort > xxx
mv -f xxx %1
) else (
echo File doesn't exist
)
goto END
:BAR_PAR
echo No Parameter Passed
:END
But this screws up my files that have encoding UCS-2 Little Endian.
Is there a way to handle all encoding blindly?
If not, what should I do to make this UCS-2 Little Endian Compatible?
UPDATE
Forgot to mention that I was using Windows but with Cygwin so I have general linux shell commands like grep, sed, etc...
Cygwin sort -f will sort the file case-insensitively by converting all characters to upper-case.
Cygwin iconv converts from one character set to another.
grep -e '[[:graph:]]' foo.txt | sort -f
In short, this command looks for any line that has at least one visible character. Therefore, lines with only spaces and tabs are excluded.
For some reason, the file I was working with didn't respond to any combination I could think of using '^' and '$'.

Search a text in selected coding system in file hierarchy

I want to search for text in a specified coding system (cp1251/UTF-8/UTF-16-le/iso-8859-4, etc) in a file hierarchy.
For example I have source code in cp1251 coding and I run Debian with system coding UTF-8. grep or Midnight Commander perform searches in UTF-8 coding. So I can not find Russian words.
Preferred solutions will use standard POSIX or GNU command line utilities (like grep).
MC or Emacs solution also appreciated.
I tried:
$ grep `echo Привет | iconv -f cp1251 -t utf-8` *
but this command does not show results sometimes.
The command you proposed outputs the string Привет, then pipes the result of that output to iconv and applies grep to the result of iconv. That is not what you want. What you want is this:
find . -type f -printf "iconv -f cp1251 -t utf-8 '%p' | grep --label '%p' -H 'Привет'\n" | sh
This applies iconv, followed by grep, to every file below the current directory.
But note that this assumes that all of your files are in CP1251. It will fail if only some of them are. In that case you'd first have to write a program that detects the encoding of a file and then applies iconv only if necessary.
From the command line:
LANG=ru_RU.cp1251 grep Привет *

How to search and replace in text files only?

I have a directory containing a bunch of files, some text some binary, with no consistent naming. I want to search and replace a string in text files only. So I went with:
perl -i -pne 's#/some/text/to/replace#/replacement/text#' *
Remove the -i option and you will see that binary files get caught. How do I modify this one-liner to skip binary files?
ack -n --text --sort -f . | xargs perl -i -pne 's…'
Abusing ack goes much quicker than writing your own solution with -T.
Well, this is all based on what your definition of a text file is. Perl 5 has the -T filetest operator that will tell you if a filename or filehandle is a text file (using Perl 5's definition):
perl -i -pne 'BEGIN{#ARGV=grep-T,#ARGV}s#regex#replacement#' *
The BEGIN block will filter out any files that don't pass the -T test, so they won't even be read (except for their first block because that is what -T uses to determine if they are text).
From perldoc -f -X
The -T and -B switches work as follows. The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set. If too many strange characters (>30%) are found, it's a -B file; otherwise it's a -T file. Also, any file containing a zero byte in the first block is considered a binary file. If -T or -B is used on a filehandle, the current IO buffer is examined rather than the first block. Both -T and -B return true on an empty file, or a file at EOF when testing a filehandle. Because you have to read a file to do the -T test, on most occasions you want to use a -f against the file first, as in next unless -f $file && -T $file .