Encoding utf-8 doesn't recognize text

Encoding utf-8 doesn't recognize text - encoding

Terminal: screen in xterm on the latest Ubuntu LiveCD.
��� �������.avi
While I'm trying to ls directory, I see this:
ls -la give me this:
MidNight Commander show me this:
$ ls
??? ???????.avi
$ env | grep -i LANG
LANG=en_US.UTF-8
$ export | grep -i LANG
declare -x LANG="en_US.UTF-8"
Looks like this is UTF-16 surrogate, am I right? [
en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Surrogates
I'm trying to trick it through python3, I'm caught such exception:
for i in os.listdir('.'):
print (i)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in
position 0: surrogates not allowed
I've uploaded file with empty body, just title - 4.0K: https://mega.co.nz/#!roYUyQaB!AwOMDznj9DC_wSpAeWqjVj_Oqu2z8Kfk5VsSmFs0ybA

$ echo $'\xc4\xf3\xf5 \xe2\xf0\xe5\xec\xe5\xed\xed' | chardet
<stdin>: MacCyrillic (confidence: 0.92)
$ echo $'\xc4\xf3\xf5 \xe2\xf0\xe5\xec\xe5\xed\xed' | enca -L ru
MS-Windows code page 1251
LF line terminators
$ echo $'\xc4\xf3\xf5 \xe2\xf0\xe5\xec\xe5\xed\xe8' | iconv -f 'Windows-1251'
Дух времени
So you need to set your terminal to Windows-1251.

Related

Formating file changes encoding on Redhat system

I have a bash script which extract data from an oracle database. I use spool to extract data. After extraction I format the file by removing and replacing some characters. My problem is after formating the files are in ANSI encoding instead of ut8.
Extraction with spool. The file is utf8
Format with cat and tr command and redirect in another file. This file is ansi.
The same process works fine on Aix system. I try iconv but it doesnt work. Do you please have an idea why the encoding changes from utf8 to ansi ? How to correct it please ?

You should consequently use either ISO-8859-1 or UTF-8. In the latter case, don't use tr as it doesn't (yet?) support multi-byte characters, use sed instead (e.g sed 's/deletethis//g').
ISO-8859-1:
export LC_CTYPE=fr_FR.ISO-8859-1
export NLS_LANG=French_France.WE8ISO8859P1
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.latin1
# or the same with hex-codes:
sed $'s/\xea/[e-circumflex]/g' test.latin1
UTF-8:
export LC_CTYPE=fr_FR.UTF-8
export NLS_LANG=French_France.AL32UTF8
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.utf8
# or the same with hex-codes:
sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8
Note: no conversion (iconv, recode, etc) is required, just make sure NLS_LANG and LC_CTYPE are compatible. (Also, your terminal(emulator) should be set accordingly; for PuTTY it is Configuration/Category/Window/Translation/Remote-character-set.)
Original answer:
I cannot tell what's wrong with the formatting you perform, but here is a method to damage the utf8-encoded text:
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3 ..RV..ZT..R.. T.
00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a .K..RF..R..G..P.
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352 .RV.ZT..R. T.K.R
00000010: 46c3 52c3 47c3 500a F.R.G.P.
Here the tr -d $'\200-\237' part deleted half of the utf8-sequences (c381 became c3, c590 became c5), rendering the text unusable.

Command substitution in fish

I'm trying to migrate this working command
docker-compose $(find docker-compose* | sed -e "s/^/-f /") up -d --remove-orphans
from bash to fish. The intention of this command is to get this
docker-compose -f docker-compose.backups.yml ... -f docker-compose.wiki.yml up -d --remove-orphans
My naive try
docker-compose (find docker-compose* | sed -e "s/^/-f /") up -d --remove-orphans
is not working, though. The error is:
ERROR: .FileNotFoundError: [Errno 2] No such file or directory: './ docker-compose.backups.yml'
What is the correct translation?

The difference in behavior is due to the fact fish, sanely, only splits the output of a command capture on line boundaries. Whereas POSIX shells like bash split it on whitespace by default. That is, POSIX shells split the output of $(...) on the value of $IFS which is space, tab, and newline by default.
There are several ways to rewrite that command so it works in fish. The one that requires the smallest change is to change the sed to insert a newline between the -f and the filename:
docker-compose (find docker-compose* | sed -e "s/^/-f\n/") up -d --remove-orphans

How to pipe tail -f to iconv cmmand?

I have a log file encoding with gbk, I have to read the data like this:
tail -n 2000 nohup.out | iconv -f gbk -t utf-8
but when I use tail -f it will print nothing:
tail -f nohup.out | iconv -f gbk -t utf-8

In a similar situation I use a script that reads each line and convert. In your case:
tail -f nohup.out | iconv.sh
#!/bin/bash
#iconv.sh
IFS=''
while read line
do
echo "$line" | iconv -f gbk -t utf-8
done < "${1:-/dev/stdin}"

Inconsistent External Command Output

The terminal transcript speaks for itself:
iMac:~$ echo -n a | md5
0cc175b9c0f1b6a831c399e269772661
iMac:~$ perl -e 'system "echo -n a | md5"'
c3392e9373ccca33629d82b17699420f
Note that the MD5 hash of a is 0cc175b9c0f1b6a831c399e269772661, the first
result. Why does it turns out to be different when the same command is called
by perl?
By the way, perl is perl 5, version 12, subversion 4 (v5.12.4) built for darwin-thread-multi-2level. And the system: Mac OS 10.8, Darwin 12.0

When in the /bin/sh shell on mac, echo -n doesn't not print out the newline like it does in /bin/bash. You can see this if you drop into /bin/sh and run echo -n a, your output should look like this:
sh-3.2$ echo -n a
-n a
so you're literally getting -n a instead of the desired a. As perl system runs /bin/sh to evaluate your command, -n a is being passed into md5 instead of your desired a

The specific question has already been answered, but I want to point out that od is useful to help understand exactly what any command outputs or file contains. This is useful especially to show otherwise non-printing characters.
$ echo -n a | od -tc
0000000 a
0000001
$ perl -e 'system "echo -n a | od -tc";'
0000000 - n a \n
0000005

Search a text in selected coding system in file hierarchy

I want to search for text in a specified coding system (cp1251/UTF-8/UTF-16-le/iso-8859-4, etc) in a file hierarchy.
For example I have source code in cp1251 coding and I run Debian with system coding UTF-8. grep or Midnight Commander perform searches in UTF-8 coding. So I can not find Russian words.
Preferred solutions will use standard POSIX or GNU command line utilities (like grep).
MC or Emacs solution also appreciated.
I tried:
$ grep `echo Привет | iconv -f cp1251 -t utf-8` *
but this command does not show results sometimes.

The command you proposed outputs the string Привет, then pipes the result of that output to iconv and applies grep to the result of iconv. That is not what you want. What you want is this:
find . -type f -printf "iconv -f cp1251 -t utf-8 '%p' | grep --label '%p' -H 'Привет'\n" | sh
This applies iconv, followed by grep, to every file below the current directory.
But note that this assumes that all of your files are in CP1251. It will fail if only some of them are. In that case you'd first have to write a program that detects the encoding of a file and then applies iconv only if necessary.

From the command line:
LANG=ru_RU.cp1251 grep Привет *

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Encoding utf-8 doesn't recognize text - encoding

Related

Formating file changes encoding on Redhat system

Command substitution in fish

How to pipe tail -f to iconv cmmand?

Inconsistent External Command Output

Search a text in selected coding system in file hierarchy

Categories

Resources