I am converting binary data into hex and viewing this hex data in head from a continuous stream.
I run the following where the conversion is from here
echo "ibase=2;obase=10000;$(echo `sed '1q;d' /Users/masi/Dropbox/123/r3.raw`)" \
\
| bc \
\
| head
and I get
(standard_in) 1: illegal character: H
so wrong datatype.
How can you do the conversion form binary to binary ascii by a single command efficietly?
I run the following code based on Wintermute's comment
hexdump -e '/4 "%08x\n"' r3.raw
For instance, head r3.raw | hexdump -e '/4 "%08x\n" gives
ffffffff
555eea57
...
Related
I have a bash script which extract data from an oracle database. I use spool to extract data. After extraction I format the file by removing and replacing some characters. My problem is after formating the files are in ANSI encoding instead of ut8.
Extraction with spool. The file is utf8
Format with cat and tr command and redirect in another file. This file is ansi.
The same process works fine on Aix system. I try iconv but it doesnt work. Do you please have an idea why the encoding changes from utf8 to ansi ? How to correct it please ?
You should consequently use either ISO-8859-1 or UTF-8. In the latter case, don't use tr as it doesn't (yet?) support multi-byte characters, use sed instead (e.g sed 's/deletethis//g').
ISO-8859-1:
export LC_CTYPE=fr_FR.ISO-8859-1
export NLS_LANG=French_France.WE8ISO8859P1
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.latin1
# or the same with hex-codes:
sed $'s/\xea/[e-circumflex]/g' test.latin1
UTF-8:
export LC_CTYPE=fr_FR.UTF-8
export NLS_LANG=French_France.AL32UTF8
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.utf8
# or the same with hex-codes:
sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8
Note: no conversion (iconv, recode, etc) is required, just make sure NLS_LANG and LC_CTYPE are compatible. (Also, your terminal(emulator) should be set accordingly; for PuTTY it is Configuration/Category/Window/Translation/Remote-character-set.)
Original answer:
I cannot tell what's wrong with the formatting you perform, but here is a method to damage the utf8-encoded text:
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3 ..RV..ZT..R.. T.
00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a .K..RF..R..G..P.
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352 .RV.ZT..R. T.K.R
00000010: 46c3 52c3 47c3 500a F.R.G.P.
Here the tr -d $'\200-\237' part deleted half of the utf8-sequences (c381 became c3, c590 became c5), rendering the text unusable.
In another question someone suggested echo -e with \0<sequence> for octal, and \x<sequence> for hex. E.g.:
echo -e "\\0302\\0241" --> ¡
Is there a simple way to convert in the other direction, from UTF-8 character to printed octal/hex sequence?
Yep - use hexdump, like this:
$ echo -n i | hexdump
Which will output something like this:
0000000 0069
0000003
For something more formatted, you could do this:
$ echo ü | hexdump | awk '{print "\\x"toupper(substr($2,3,4)) "\\x"toupper(substr($2,0,2)) "\\x"toupper(substr($3,3,4))}' | head -1
which will print out this:
\xC3\xBC\x0A
Code taken from here: How do you echo a 4-digit Unicode character in Bash?
I am am using Perl open for opening new file on Solaris 10 as follows:
open($fh, ">$filePath");
What is default file character encoding on my system with this call?
The output from locale command is given below
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
This was not as easy a question to answer as I thought it would be.
The default encoding is raw, which is suitable for binary data. Any character with an ordinal value under 256 is passed as is:
$ perl -e 'print chr(0xFF)' | od -c
00000000 377
00000001
The curious thing is what happens when you try to write a character above ordinal value 255. Then it looks like you get UTF-8 encoding.
$ perl -e 'print chr(0x100)' | od -c
00000000 304 200
00000002
I don't know where or if this behavior is documented.
I'm having some trouble getting sed to do a find/replace of some hex characters. I want to replace all instances within a file of the following hexadecimal string:
0x0D4D5348
with the following hexadecimal string:
0x0D0A4D5348
How can I do that?
EDIT: I'm trying to do a hex find/replace. The input file does not have the literal value of "0x0D4D5348" in it, but it does have the ASCII representation of that in it.
GNU sed v3.02.80, GNU sed v1.03, and HHsed v1.5 by Howard Helman
all support the notation \xNN, where "NN" are two valid hex numbers, 00-FF.
Here is how to replace a HEX sequence in your binary file:
$ sed 's/\x0D\x4D\x53\x48/\x0D\x0A\x4D\x53\x48/g' file > temp; rm file; mv temp file
As #sputnik pointed out, you can use sed's in place functionality. One caveat though, if you use it on OS/X, you'd have to add an empty set of quotes:
$ sed '' 's/\x0D\x4D\x53\x48/\x0D\x0A\x4D\x53\x48/g' file
As sed in place on OS/X takes a parameter to indicate what extension to add to the file name when making a backup, since it does create a temp file first. But then.. OS/X's sed doesn't support \x.
This worked for me on Linux and OSX.
Replacing in-place:
sed -i '.bk' 's'/`printf "\x03"`'/foo/g' index.html
(See #Ernest's comment in the answer by #tolitius)
In OS/X system's Bash, You can use command like this:
# this command will crate a variable named a which contains '\r\n' in it
a=`echo -e "hello\r\nworld\r\nthe third line\r\n"`
echo "$a" | sed $'s/\r//g' | od -c
and now you can see the output characters :
0000000 h e l l o \n w o r l d \n t h e
0000020 t h i r d l i n e \n
0000033
You should notice the difference between 's/\r//g' and $'s/\r//g'.
Based on the above practices, you can use command like this to replace hex String
echo "$a" | sed $'s/\x0d//g' | od -c
How would an awk script (presumably a one-liner) for removing a BOM look like?
Specification:
print every line after the first (NR > 1)
for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest
Using GNU sed (on Linux or Cygwin):
# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt
On FreeBSD:
sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt
Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.
On Mac:
This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.
A similar trick can be achieved with any program by piping to the sponge tool from moreutils:
awk '…' INFILE | sponge INFILE
Try this:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE
1 is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.
Not awk, but simpler:
tail -c +4 UTF8 > UTF8.nobom
To check for BOM:
hd -n 3 UTF8
If BOM is present you'll see: 00000000 ef bb bf ...
In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:
dos2unix *.txt
dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
bom-utf8 efbbbfc3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
bom-utf8 c3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).