Text encoding translation in clojure - encoding

I'd like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.
For example, let's look at the character "è". In ISO-8859-1 (http://www.ascii-code.com/), that's e8 as hex. In UTF-8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char), it's c3 a8.
So let's say we have iso.txt, which contains our letter and EOL:
$ hexdump iso.txt
0000000 e8 0a
0000002
Now we can convert it to UTF-8 like this:
$ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump
0000000 c3 a8 0a
0000003
How should I write something equivalent in clojure? I'm happy to use any external libraries, but I don't know where I'd go to find them. Looking around I couldn't figure out how to use libiconv itself on the JVM, but there's probably an alternative?
Edit
After reading Alex's link in the comment, this is so simple and so cool:
user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8")
"è"
user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1")
"è"

If you want a simple whole-file conversion to UTF-8, slurp allows for specifying the file encoding with the :encoding option and spit will output UTF-8 by default. This method will read the entire file into memory, so large files might require a different approach.
$ printf "\xe8\n" > iso.txt
$ hexdump iso.txt
0000000 e8 0a
0000002
(spit "/Users/path/iso2.txt"
(slurp "/Users/path/iso.txt" :encoding "ISO-8859-1"))
$ hexdump iso2.txt
0000000 c3 a8 0a
0000003
Note: slurp will assume UTF-8 if you do not specify an encoding.

Related

Linux SED replace HEX in file instead of insert? \x2A

I have these bytes:
E6 2A 1B EF 11 00 00 00 00 00 00 4E 43 DB E8
I need to replace them with these:
64 08 1A EF 11 00 00 00 00 00 DA D8 26 04
When I started to experiment, I've noticed one strange thing.
sed -e 's/\xE6/\x64/g'
This code replaces first E6 with 64 ok
However when I try to change more bytes (2A) causing problem.
sed -e 's/\xE6\x2A/\x64\x08/g'
as I understand 2A inserts same code.. How to avoid it? I just need to change 2A with 08. Thanks in advance :)
UPDATED
Now I've stuck on \x26. sed -e 's/\xDB/\x26/g' this code refuses to replace DB to 26, but when I run s/\xDB/\xFF it works. Any ideas? In this way something is wrong with 26. I have tried [\x26], not helped here.
OK. s/\xDB/\&/g' seems to be working :)
\x2a is *, which is special in regex :
$ sed 's/a*/b/' <<<'aaa'
b
$ sed 's/a\x2a/b/' <<<'aaa'
b
You may use bracket expression in regex to cancel the special properties of characters, but I see it doesn't work well with all characters with my GNU sed:
$ sed 's/\xE6[\x2A]/OK/' <<<$'I am \xE6\x2A'
I am OK
$ sed 's/[\xE6][\x2A]/OK/' <<<$'I am \xE6\x2A'
I am �*
That's because \xE6] is probably some invalid UTF character. Remember to use C locale:
$ LC_ALL=C sed 's/[\xE6][\x2A]/OK/' <<<$'I am \xE6\x2A'
I am OK
Remember that \1 \2 etc. and & are special in replacement part too. Read Escape a string for a sed replace pattern - but you will need to escape \xXX sequences instead of each character (or convert characters to the actual bytes first, why work with all the \xXX sequences?). Or pick another tool.

Remove invalid UNICODE characters from XML file in UNIX?

I have a shell script that I use to remotely clean an XML file produced by another system that contains invalid UNICODE characters. I am currently using this command in the script to remove the invalid characters:
perl -CSDA -i -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
and this has worked so far but now the file has new error of, as far as I can tell, 'xA0', and what happens is my perl command reaches that error in the file and erases the rest of the file. I modified my command to include xA0, but it doesn't work:
perl -CSDA -i -pe's/[^\x9\xA0\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
I have also tried using:
iconv -f UTF-8 -t UTF-8 -c file.xml > file2.xml
but that doesn't do anything. It produces an identical file with the same errors.
Is there a unix command that I can use that will completely remove all invalid UNICODE characters?
EDIT:
some HEX output (note the 1A's and A0's):
3E 1A 1A 33 30 34 39 37 1A 1A 3C 2F 70
6D 62 65 72 3E A0 39 34 32 39 38 3C 2F
You may use the following onliner:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{""}))' file.xml
You also may extend it with warnings:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{warn "Bad byte: #_";""}))' file.xml
A0 is not a valid UTF-8 sequence. The errors you were encountering where XML encoding errors, while this one is a character encoding error.
A0 is the Unicode Code Point for a non-breaking space. It is also the iso-8859-1 and cp1252 encoding of that Code Point.
I would recommend fixing the problem at its source. But if that's not possible, I would recommend using Encoding::FixLatin to fix this new type of error (perhaps via the bundled fix_latin script). It will correctly replace A0 with C2 A0 (the UTF-8 encoding of a non-breaking space).
Combined with your existing script:
perl -i -MEncoding::FixLatin=fix_latin -0777pe'
$_ = fix_latin($_);
utf8::decode($_);
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
utf8::encode($_);
' file.xml

bsd sed replace hex values in file

Using GNU Sed i'm able to replace some hex value using the following command
gsed 's/.*\xFF\xD8/\xFF\xD8/g' myfile
I'm on OSX, so the default sed is the BSD one. Unfortunately the previous command does not work the BSD sed.
Any idea why this and how to do what i'm looking for : removing everything before a FFD8 value in my file.
The simplest way to deal with that problem is to use bash's 'ANSI-C Quoting' mechanism:
sed $'s/.*\xFF\xD8/\xFF\xD8/g' myfile
Note that \xFF\xD8 is not valid UTF-8, so you may have problems with the characters, but the basic mechanism works:
$ echo sed $'s/.*\xFF\xD8/\xFF\xD8/g' | odx
0x0000: 73 65 64 20 73 2F 2E 2A FF D8 2F FF D8 2F 67 0A sed s/.*../../g.
0x0010:
$
odx is a hex dump program.

echo "string" > file in Windows PowerShell appends non-printable character to the file

In Windows PowerShell:
echo "string" > file.txt
In Cygwin:
$ cat file.txt
:::s t r i n g
$ dos2unix file.txt
dos2unix: Skipping binary file file.txt
I want a simple "string" in the file. How do I do it? I.e., when I say cat file.txt I need only "string" as output. I am echoing from Windows PowerShell and that cannot be changed.
Try echo "string" | out-file -encoding ASCII file.txt to get a simple ASCII-encoded txt file.
Comparison of the files produced:
echo "string" | out-file -encoding ASCII file.txt
will produce a file with the following contents:
73 74 72 69 6E 67 0D 0A (string..)
however
echo "string" > file.txt
will produce a file with the following contents:
FF FE 73 00 74 00 72 00 69 00 6E 00 67 00 0D 00 0A 00 (ÿþs.t.r.i.n.g.....)
(Byte order mark FF FE indicates the file is UTF-16 (LE). The signature for UTF-16 (LE) = 2 bytes: 0xFF 0xFE followed by 2 byte pairs. xx 00 xx 00 xx 00 for normal 0-127 ASCII chars
These two commands are equivalent in that they both use UTF-16 encoding by default:
echo "string" > file.txt
echo "string" | out-file file.txt
You can add an explicit encoding parameter to the latter form (as indicated by jon Z) to produce plain ASCII:
echo "string" | out-file -encoding ASCII file.txt
Alternately, you could use set-content, which uses ASCII encoding by default:
echo "string" | set-content file.txt
Corollary 1:
Want to convert a unicode file to ASCII in one line?
Just use this:
get-content your_unicode_file | set-content your_ascii_file
which can be abbreviated to:
gc your_unicode_file | sc your_ascii_file
Corollary 2:
Want to get a hex dump so you can really see what is unicode and what is ASCII?
Use the clean and simple Get-HexDump function available on PowerShell.com.
With that in place you can examine your generated files with just:
Get-HexDump file.txt
For anything non-trivial, you can specify how many columns wide you want the output and how many bytes of the file to process with something like this:
Get-HexDump file.txt -width 15 -bytes 150
PowerShell creates Unicode UTF-16 files with a Byte Order Mark (BOM).
Dos2unix 6.0 and higher can read UTF-16 files and convert them to UTF-8 (the default Cygwin encoding) and remove the BOM. Versions prior to 6.0 will see UTF-16 files as binary and skip them, as in your example.

Using awk to remove the Byte-order mark

How would an awk script (presumably a one-liner) for removing a BOM look like?
Specification:
print every line after the first (NR > 1)
for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest
Using GNU sed (on Linux or Cygwin):
# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt
On FreeBSD:
sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt
Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.
On Mac:
This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.
A similar trick can be achieved with any program by piping to the sponge tool from moreutils:
awk '…' INFILE | sponge INFILE
Try this:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE
1 is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.
Not awk, but simpler:
tail -c +4 UTF8 > UTF8.nobom
To check for BOM:
hd -n 3 UTF8
If BOM is present you'll see: 00000000 ef bb bf ...
In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:
dos2unix *.txt
dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
bom-utf8 efbbbfc3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
bom-utf8 c3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).