I have these bytes:
E6 2A 1B EF 11 00 00 00 00 00 00 4E 43 DB E8
I need to replace them with these:
64 08 1A EF 11 00 00 00 00 00 DA D8 26 04
When I started to experiment, I've noticed one strange thing.
sed -e 's/\xE6/\x64/g'
This code replaces first E6 with 64 ok
However when I try to change more bytes (2A) causing problem.
sed -e 's/\xE6\x2A/\x64\x08/g'
as I understand 2A inserts same code.. How to avoid it? I just need to change 2A with 08. Thanks in advance :)
UPDATED
Now I've stuck on \x26. sed -e 's/\xDB/\x26/g' this code refuses to replace DB to 26, but when I run s/\xDB/\xFF it works. Any ideas? In this way something is wrong with 26. I have tried [\x26], not helped here.
OK. s/\xDB/\&/g' seems to be working :)
\x2a is *, which is special in regex :
$ sed 's/a*/b/' <<<'aaa'
b
$ sed 's/a\x2a/b/' <<<'aaa'
b
You may use bracket expression in regex to cancel the special properties of characters, but I see it doesn't work well with all characters with my GNU sed:
$ sed 's/\xE6[\x2A]/OK/' <<<$'I am \xE6\x2A'
I am OK
$ sed 's/[\xE6][\x2A]/OK/' <<<$'I am \xE6\x2A'
I am �*
That's because \xE6] is probably some invalid UTF character. Remember to use C locale:
$ LC_ALL=C sed 's/[\xE6][\x2A]/OK/' <<<$'I am \xE6\x2A'
I am OK
Remember that \1 \2 etc. and & are special in replacement part too. Read Escape a string for a sed replace pattern - but you will need to escape \xXX sequences instead of each character (or convert characters to the actual bytes first, why work with all the \xXX sequences?). Or pick another tool.
Related
I have a shell script that I use to remotely clean an XML file produced by another system that contains invalid UNICODE characters. I am currently using this command in the script to remove the invalid characters:
perl -CSDA -i -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
and this has worked so far but now the file has new error of, as far as I can tell, 'xA0', and what happens is my perl command reaches that error in the file and erases the rest of the file. I modified my command to include xA0, but it doesn't work:
perl -CSDA -i -pe's/[^\x9\xA0\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
I have also tried using:
iconv -f UTF-8 -t UTF-8 -c file.xml > file2.xml
but that doesn't do anything. It produces an identical file with the same errors.
Is there a unix command that I can use that will completely remove all invalid UNICODE characters?
EDIT:
some HEX output (note the 1A's and A0's):
3E 1A 1A 33 30 34 39 37 1A 1A 3C 2F 70
6D 62 65 72 3E A0 39 34 32 39 38 3C 2F
You may use the following onliner:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{""}))' file.xml
You also may extend it with warnings:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{warn "Bad byte: #_";""}))' file.xml
A0 is not a valid UTF-8 sequence. The errors you were encountering where XML encoding errors, while this one is a character encoding error.
A0 is the Unicode Code Point for a non-breaking space. It is also the iso-8859-1 and cp1252 encoding of that Code Point.
I would recommend fixing the problem at its source. But if that's not possible, I would recommend using Encoding::FixLatin to fix this new type of error (perhaps via the bundled fix_latin script). It will correctly replace A0 with C2 A0 (the UTF-8 encoding of a non-breaking space).
Combined with your existing script:
perl -i -MEncoding::FixLatin=fix_latin -0777pe'
$_ = fix_latin($_);
utf8::decode($_);
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
utf8::encode($_);
' file.xml
Using GNU Sed i'm able to replace some hex value using the following command
gsed 's/.*\xFF\xD8/\xFF\xD8/g' myfile
I'm on OSX, so the default sed is the BSD one. Unfortunately the previous command does not work the BSD sed.
Any idea why this and how to do what i'm looking for : removing everything before a FFD8 value in my file.
The simplest way to deal with that problem is to use bash's 'ANSI-C Quoting' mechanism:
sed $'s/.*\xFF\xD8/\xFF\xD8/g' myfile
Note that \xFF\xD8 is not valid UTF-8, so you may have problems with the characters, but the basic mechanism works:
$ echo sed $'s/.*\xFF\xD8/\xFF\xD8/g' | odx
0x0000: 73 65 64 20 73 2F 2E 2A FF D8 2F FF D8 2F 67 0A sed s/.*../../g.
0x0010:
$
odx is a hex dump program.
I'd like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.
For example, let's look at the character "è". In ISO-8859-1 (http://www.ascii-code.com/), that's e8 as hex. In UTF-8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char), it's c3 a8.
So let's say we have iso.txt, which contains our letter and EOL:
$ hexdump iso.txt
0000000 e8 0a
0000002
Now we can convert it to UTF-8 like this:
$ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump
0000000 c3 a8 0a
0000003
How should I write something equivalent in clojure? I'm happy to use any external libraries, but I don't know where I'd go to find them. Looking around I couldn't figure out how to use libiconv itself on the JVM, but there's probably an alternative?
Edit
After reading Alex's link in the comment, this is so simple and so cool:
user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8")
"è"
user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1")
"è"
If you want a simple whole-file conversion to UTF-8, slurp allows for specifying the file encoding with the :encoding option and spit will output UTF-8 by default. This method will read the entire file into memory, so large files might require a different approach.
$ printf "\xe8\n" > iso.txt
$ hexdump iso.txt
0000000 e8 0a
0000002
(spit "/Users/path/iso2.txt"
(slurp "/Users/path/iso.txt" :encoding "ISO-8859-1"))
$ hexdump iso2.txt
0000000 c3 a8 0a
0000003
Note: slurp will assume UTF-8 if you do not specify an encoding.
My input foo.txt is this:
Grull^Zn Hernand^Zz
where the ^Z resolves to the control character \x1a (verified with od -x on the file )
When I run the following Perl command:
perl -pe s/\x1a//g foo.txt
I get the output: Grulln Hernandz
as expected. However when I redirect this to a file
perl -pe s/\x1a//g foo.txt > out.txt
The files are identical, demonstrated by
diff -c out.txt foo.txt
No differences encountered
How can I force this behavior to work as expected?
I don't know how you're are ascertaining that the first version works, but it doesn't for me.
You need to either escape the backslash in the regex, or quote it (quoting it is more common).
$ hexdump -C input
00000000 61 62 63 1a 64 65 66 1a 67 68 69 0a |abc.def.ghi.|
$ perl -pe s/\x1a//g input | hexdump -C
00000000 61 62 63 1a 64 65 66 1a 67 68 69 0a |abc.def.ghi.|
$ perl -pe s/\\x1a//g input | hexdump -C
00000000 61 62 63 64 65 66 67 68 69 0a |abcdefghi.|
$ perl -pe 's/\x1a//g' input | hexdump -C
00000000 61 62 63 64 65 66 67 68 69 0a |abcdefghi.|
I don't think
perl -pe s/\x1a//g foo.txt
does what you think it does. In any sane solaris shell, an unquoted \x is treated the same as x, and you are running the same thing as
perl -pe s/x1a//g foo.txt
You can test this by executing
echo s/\x1a//g
and see what gets passed to the shell. You can also try
perl -pe s/\x1a//g foo.txt | od -c
to see whether the control characters really are removed from your input.
The correct thing to do is to enclose your one-line script in single quotes:
perl -pe 's/\x1a//g' foo.txt > out.txt
What I ultimately ended up doing (though I found out that mob's solution worked too) was instead of entering \x1a I pressed and held Ctrl, then v, z
This also has the benefit of being a little more readable.
In Windows PowerShell:
echo "string" > file.txt
In Cygwin:
$ cat file.txt
:::s t r i n g
$ dos2unix file.txt
dos2unix: Skipping binary file file.txt
I want a simple "string" in the file. How do I do it? I.e., when I say cat file.txt I need only "string" as output. I am echoing from Windows PowerShell and that cannot be changed.
Try echo "string" | out-file -encoding ASCII file.txt to get a simple ASCII-encoded txt file.
Comparison of the files produced:
echo "string" | out-file -encoding ASCII file.txt
will produce a file with the following contents:
73 74 72 69 6E 67 0D 0A (string..)
however
echo "string" > file.txt
will produce a file with the following contents:
FF FE 73 00 74 00 72 00 69 00 6E 00 67 00 0D 00 0A 00 (ÿþs.t.r.i.n.g.....)
(Byte order mark FF FE indicates the file is UTF-16 (LE). The signature for UTF-16 (LE) = 2 bytes: 0xFF 0xFE followed by 2 byte pairs. xx 00 xx 00 xx 00 for normal 0-127 ASCII chars
These two commands are equivalent in that they both use UTF-16 encoding by default:
echo "string" > file.txt
echo "string" | out-file file.txt
You can add an explicit encoding parameter to the latter form (as indicated by jon Z) to produce plain ASCII:
echo "string" | out-file -encoding ASCII file.txt
Alternately, you could use set-content, which uses ASCII encoding by default:
echo "string" | set-content file.txt
Corollary 1:
Want to convert a unicode file to ASCII in one line?
Just use this:
get-content your_unicode_file | set-content your_ascii_file
which can be abbreviated to:
gc your_unicode_file | sc your_ascii_file
Corollary 2:
Want to get a hex dump so you can really see what is unicode and what is ASCII?
Use the clean and simple Get-HexDump function available on PowerShell.com.
With that in place you can examine your generated files with just:
Get-HexDump file.txt
For anything non-trivial, you can specify how many columns wide you want the output and how many bytes of the file to process with something like this:
Get-HexDump file.txt -width 15 -bytes 150
PowerShell creates Unicode UTF-16 files with a Byte Order Mark (BOM).
Dos2unix 6.0 and higher can read UTF-16 files and convert them to UTF-8 (the default Cygwin encoding) and remove the BOM. Versions prior to 6.0 will see UTF-16 files as binary and skip them, as in your example.