I'm trying to write a program to read a pcap file captured in linux (tcpdump version 4.5.1 libpcap version 1.5.3) but I can't get the byte swapping correct. The magic number isn't one of the values I expect (0xa1b2c3d4 or 0xd4c3b2a1) but is 0xc3d4a1b2. the 'file' command correctly identifies it (tcpdump capture file (little-endian) - version 2.4 (Ethernet, capture length 65535)) and 'tcpdump -r' reads it but I don't understand how. The magic number doesn't look little-endian OR big-endian to me. The hexdump looks like:
0000000 c3d4 a1b2 0002 0004 0000 0000 0000 0000
0000010 ffff 0000 0001 0000 6be0 5a87 a747 0008
What byte ordering is this file in?
It is probably just how the data are displayed. I'm assuming your are using hexdump. By default this program is using a two-byte hexadecimal display, i.e. it is reading two bytes and interprets these as an unsigned short:
$ hexdump file.pcap
0000000 c3d4 a1b2 ...
To get a byte-wise display you can use for example the -C option:
$ hexdump -C file.pcap
00000000 d4 c3 b2 a1 ...
Or you could use xxd:
$ xxd file.pcap
00000000: d4c3 b2a1 ...
Related
I have a shell script that I use to remotely clean an XML file produced by another system that contains invalid UNICODE characters. I am currently using this command in the script to remove the invalid characters:
perl -CSDA -i -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
and this has worked so far but now the file has new error of, as far as I can tell, 'xA0', and what happens is my perl command reaches that error in the file and erases the rest of the file. I modified my command to include xA0, but it doesn't work:
perl -CSDA -i -pe's/[^\x9\xA0\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
I have also tried using:
iconv -f UTF-8 -t UTF-8 -c file.xml > file2.xml
but that doesn't do anything. It produces an identical file with the same errors.
Is there a unix command that I can use that will completely remove all invalid UNICODE characters?
EDIT:
some HEX output (note the 1A's and A0's):
3E 1A 1A 33 30 34 39 37 1A 1A 3C 2F 70
6D 62 65 72 3E A0 39 34 32 39 38 3C 2F
You may use the following onliner:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{""}))' file.xml
You also may extend it with warnings:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{warn "Bad byte: #_";""}))' file.xml
A0 is not a valid UTF-8 sequence. The errors you were encountering where XML encoding errors, while this one is a character encoding error.
A0 is the Unicode Code Point for a non-breaking space. It is also the iso-8859-1 and cp1252 encoding of that Code Point.
I would recommend fixing the problem at its source. But if that's not possible, I would recommend using Encoding::FixLatin to fix this new type of error (perhaps via the bundled fix_latin script). It will correctly replace A0 with C2 A0 (the UTF-8 encoding of a non-breaking space).
Combined with your existing script:
perl -i -MEncoding::FixLatin=fix_latin -0777pe'
$_ = fix_latin($_);
utf8::decode($_);
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
utf8::encode($_);
' file.xml
I'd like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.
For example, let's look at the character "è". In ISO-8859-1 (http://www.ascii-code.com/), that's e8 as hex. In UTF-8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char), it's c3 a8.
So let's say we have iso.txt, which contains our letter and EOL:
$ hexdump iso.txt
0000000 e8 0a
0000002
Now we can convert it to UTF-8 like this:
$ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump
0000000 c3 a8 0a
0000003
How should I write something equivalent in clojure? I'm happy to use any external libraries, but I don't know where I'd go to find them. Looking around I couldn't figure out how to use libiconv itself on the JVM, but there's probably an alternative?
Edit
After reading Alex's link in the comment, this is so simple and so cool:
user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8")
"è"
user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1")
"è"
If you want a simple whole-file conversion to UTF-8, slurp allows for specifying the file encoding with the :encoding option and spit will output UTF-8 by default. This method will read the entire file into memory, so large files might require a different approach.
$ printf "\xe8\n" > iso.txt
$ hexdump iso.txt
0000000 e8 0a
0000002
(spit "/Users/path/iso2.txt"
(slurp "/Users/path/iso.txt" :encoding "ISO-8859-1"))
$ hexdump iso2.txt
0000000 c3 a8 0a
0000003
Note: slurp will assume UTF-8 if you do not specify an encoding.
I am am using Perl open for opening new file on Solaris 10 as follows:
open($fh, ">$filePath");
What is default file character encoding on my system with this call?
The output from locale command is given below
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
This was not as easy a question to answer as I thought it would be.
The default encoding is raw, which is suitable for binary data. Any character with an ordinal value under 256 is passed as is:
$ perl -e 'print chr(0xFF)' | od -c
00000000 377
00000001
The curious thing is what happens when you try to write a character above ordinal value 255. Then it looks like you get UTF-8 encoding.
$ perl -e 'print chr(0x100)' | od -c
00000000 304 200
00000002
I don't know where or if this behavior is documented.
According to this calculator site (link text), when converting 3 from decimal to double, I should get 4008 0000 0000 0000.
When using the Perl pack function, with the parameter "d>*", I expected to see 4008 0000 0000 0000 as I use this function:
print $File pack("d>*", 3);
But when I "hexdump" to the Perl output file, I see 0840 0000 0000 0000.
I thought that it might belong to the big/little endian, but when trying the little endian,
print $File pack("d<*", 3);
I get this: 0000 0000 0000 4008
What shall I do if I want to get the result 4008 0000 0000 0000 from Perl pack output?
By the way, when using "float" - everything works like it is expected to be.
Your intuition about the byte order in Perl is correct, but I think that the hexdump output doesn't mean what you think it does. It looks like hexdump is displaying each pair of bytes in a consistent but counterintuitive order. Here are some experiments you can run to get your bearings.
# bytes are stored in the order that they are printed
$ perl -e 'print "\x{01}\x{02}\x{03}\x{04}"' | od -c
0000000 001 002 003 004
0000004
# perl reads the bytes in the correct order
$ perl -e 'print "\x{01}\x{02}\x{03}\x{04}"' | perl -ne 'print map{ord,$"}split//'
1 2 3 4
# but the way hexdump displays the bytes is confusing (od -h gives same output)
$ perl -e 'print "\x{01}\x{02}\x{03}\x{04}"' | hexdump
0000000 0201 0403
0000004
How would an awk script (presumably a one-liner) for removing a BOM look like?
Specification:
print every line after the first (NR > 1)
for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest
Using GNU sed (on Linux or Cygwin):
# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt
On FreeBSD:
sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt
Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.
On Mac:
This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.
A similar trick can be achieved with any program by piping to the sponge tool from moreutils:
awk '…' INFILE | sponge INFILE
Try this:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE
1 is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.
Not awk, but simpler:
tail -c +4 UTF8 > UTF8.nobom
To check for BOM:
hd -n 3 UTF8
If BOM is present you'll see: 00000000 ef bb bf ...
In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:
dos2unix *.txt
dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
bom-utf8 efbbbfc3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
bom-utf8 c3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).