According to this calculator site (link text), when converting 3 from decimal to double, I should get 4008 0000 0000 0000.
When using the Perl pack function, with the parameter "d>*", I expected to see 4008 0000 0000 0000 as I use this function:
print $File pack("d>*", 3);
But when I "hexdump" to the Perl output file, I see 0840 0000 0000 0000.
I thought that it might belong to the big/little endian, but when trying the little endian,
print $File pack("d<*", 3);
I get this: 0000 0000 0000 4008
What shall I do if I want to get the result 4008 0000 0000 0000 from Perl pack output?
By the way, when using "float" - everything works like it is expected to be.
Your intuition about the byte order in Perl is correct, but I think that the hexdump output doesn't mean what you think it does. It looks like hexdump is displaying each pair of bytes in a consistent but counterintuitive order. Here are some experiments you can run to get your bearings.
# bytes are stored in the order that they are printed
$ perl -e 'print "\x{01}\x{02}\x{03}\x{04}"' | od -c
0000000 001 002 003 004
0000004
# perl reads the bytes in the correct order
$ perl -e 'print "\x{01}\x{02}\x{03}\x{04}"' | perl -ne 'print map{ord,$"}split//'
1 2 3 4
# but the way hexdump displays the bytes is confusing (od -h gives same output)
$ perl -e 'print "\x{01}\x{02}\x{03}\x{04}"' | hexdump
0000000 0201 0403
0000004
Related
I'm trying to write a program to read a pcap file captured in linux (tcpdump version 4.5.1 libpcap version 1.5.3) but I can't get the byte swapping correct. The magic number isn't one of the values I expect (0xa1b2c3d4 or 0xd4c3b2a1) but is 0xc3d4a1b2. the 'file' command correctly identifies it (tcpdump capture file (little-endian) - version 2.4 (Ethernet, capture length 65535)) and 'tcpdump -r' reads it but I don't understand how. The magic number doesn't look little-endian OR big-endian to me. The hexdump looks like:
0000000 c3d4 a1b2 0002 0004 0000 0000 0000 0000
0000010 ffff 0000 0001 0000 6be0 5a87 a747 0008
What byte ordering is this file in?
It is probably just how the data are displayed. I'm assuming your are using hexdump. By default this program is using a two-byte hexadecimal display, i.e. it is reading two bytes and interprets these as an unsigned short:
$ hexdump file.pcap
0000000 c3d4 a1b2 ...
To get a byte-wise display you can use for example the -C option:
$ hexdump -C file.pcap
00000000 d4 c3 b2 a1 ...
Or you could use xxd:
$ xxd file.pcap
00000000: d4c3 b2a1 ...
I've been trying for a couple hours to create a conceptually trivial filter that I can use on the command line, without success. The task is to filter out all lines containing Hangul Jamo characters, while retaining all other lines (which may contain ASCII, characters in the Hangul Syllable block, etc.).
So for example if the input was
foo
ᅤᆨ
간
the output would contain the first and third lines, but not the second, since the second line contains Jamo characters. (The above is not meant to be real Korean, just a simple test case.)
I'm very disappointed with the Gnu grep utility (version 2.20). I would have thought the ff. would work:
grep -Pv '[\x{1100}-\x{11FF}]'
but instead I get the error message grep: character value in \x{...} sequence is too large. (The \u1100 syntax, which is the actual Perl syntax, simply isn't supported.)
(I do notice that our version 2.20 is rather old. If someone tries the above with a newer version of grep, and it works, I'll certainly consider that an answer--and I'll get our IT folks to upgrade!)
I tried sed, but didn't get any further. (Sorry, I don't remember exactly what sed commands I tried, but sed's support for Unicode blocks doesn't seem any better than grep's.)
Finally, I tried perl (v5.16.3):
perl -ne 'print unless /[\u1100-\u11ff]/'
This at least succeeds in eliminating the Jamo lines while retaining the Hangul Syllable lines, but it also eliminates the ASCII lines, which I don't want to do. I also would have thought one of the ff. would work:
perl -ne 'print unless /\p{InHangul_Jamo}/'
perl -ne 'print unless /\p{Block: Hangul_Jamo}/'
but neither appears to have any effect. (Afaik, I shouldn't have to have a .* on each side of the \p{...}, but I tried that too; no luck.)
Locale: in case it matters, I have LANG=en_US.UTF-8.
I'm sure I could do this in Python, but I'd like to understand why neither grep nor perl seems to work, because they'd be a lot simpler. (And if I'm right about the Gnu utilities having poor Unicode support, why that is...and when it will be fixed. It's not like Unicode is new!) Of course I realize the problem may be that I'm not holding my mouth right when I try, but if so, it would be nice for grep at least to have better documentation on Unicode usage. Right now the documentation for grep -P says "This is highly experimental and grep -P may warn of unimplemented features." And it seems to have been that way roughly forever.
Decode inputs, encode outputs. If the encoding in question is UTF-8, the command-line switch -CSD will come in useful.
perl -CSD -ne'print if !/\p{Block: Hangul_Jamo}/'
perl -CSD -ne'print if !/\p{Block: Jamo}/'
perl -CSD -ne'print if !/\p{Blk=Jamo}/'
perl -CSD -ne'print if !/\p{InJamo}/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}]/'
grep -vP '[\x{1100}-\x{11FF}]'
You might want to add the Hangul_Jamo_Extended_A, Hangul_Jamo_Extended_B and Hangul_Compatibility_Jamo blocks.
perl -CSD -ne'print if !/[\p{Block: Hangul_Jamo}\p{Block: Hangul_Jamo_Extended_A}\p{Block: Hangul_Jamo_Extended_B}\p{Block: Hangul_Compatibility_Jamo}]/'
perl -CSD -ne'print if !/[\p{Block: Jamo}\p{Block: JamoExtA}\p{Block: JamoExtB}\p{Block: CompatJamo}]/'
perl -CSD -ne'print if !/[\p{Blk=Jamo}\p{Blk=JamoExtA}\p{Blk=JamoExtB}\p{Blk=CompatJamo}]/'
perl -CSD -ne'print if !/[\p{InJamo}\p{InJamoExtA}\p{InJamoExtB}\p{InCompatJamo}]/'
perl -CSD -ne'print if !/[\N{U+1100}-\N{U+11FF}\N{U+A960}-\N{U+A97F}\N{U+D7B0}-\N{U+D7FF}\N{U+3130}-\N{U+318F}]/'
perl -CSD -ne'print if !/[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]/'
grep -vP '[\x{1100}-\x{11FF}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}\x{3130}-\x{318F}]'
Let's look at your failed attempts.
grep -Pv '[\x{1100}-\x{11FF}]'
Actually, this one should work, and it does for me.
$ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | od -t x1
0000000 61 62 63 0a 64 e1 84 80 66 0a 67 68 69 0a
0000016
$ perl -CSD -e'print "abc\nd\x{1100}f\nghi\n"' | grep -Pv '[\x{1100}-\x{11FF}]'
abc
ghi
$ grep --version | head -1
grep (GNU grep) 2.16
I do get your error on an older machine with grep (GNU grep) 2.10.
perl -ne'print unless /\p{Block: Hangul_Jamo}/'
You didn't get any matches from /\p{Block: Hangul_Jamo}/ because you were matching against encoded text (UTF-8 bytes, chars in the range 00..FF) instead of decoded text (Unicode Code Points, chars in the range 00000..10FFFF).
perl -ne 'print unless /\p{InHangul_Jamo}/'
\p{Block: X}, \p{Blk=X} and \p{InX} are equivalent.
perl -ne'print unless /[\x{1100}-\x{11FF}]/'
[\x{1100}-\x{11FF}] is equivalent to \p{Block: Hangul_Jamo}.
perl -ne'print unless /[\u1100-\u11ff]/'
You got too many matches since \u in double-quoted string literals and in regex pattern literals titlecases the next character. (e.g. "\uxyx" is equivalent to "Xyz".)
As such, [\u1100-\u11ff] is equivalent to [01f].
for what it's worth, this is my own jamo filter in gnu-grep :
noJamo is an alias for
ggrep -vP '[\x{1100}-\x{11FF}
\x{A960}-\x{A97F}
\x{D7B0}-\x{D7FF}
\x{3130}-\x{318F}]'
However, if you only care about the core Jamo set that maps to 11,172 syllables, and don't mind using something other than grep, then this should be extremely fast :
\341\204[\200-\222]|
\341\205[\241-\265]|
\341\206[\250-\277]|\341\207[\200-\202]
if you add up the octals in each line, they're exactly 19 cho in row 1, 21 jung in row 2, and 28 jong in row 3.
I did a quick benchmark with a synthetic 5.55 GB .txt file containing lines that add up to some 4.3 GB.
And this regex's filtering throughput was some 1.55 GB/sec, practically at the limit of my SSD I/O.
(time (pvE0 < jamotest000001.txt|
mawk2 'BEGIN{ FS=ORS }
/\341(\204[\200-\222]|
\205[\241-\265]|
\206[\250-\277]|
\207[\200-\202] )/'
| pvE9 | xxh128sum))| ecp;
in0: 5.55GiB 0:00:03 [1.55GiB/s] [1.55GiB/s]
[=================>] 100%
out9: 4.29GiB 0:00:03 [1.20GiB/s] [1.20GiB/s]
[ <=> ]
( pvE 0.1 in0 < jamotest000001.txt | mawk2 | pvE 0.1 out9 | xxh128sum; )
3.70s user 2.73s system 178% cpu 3.597 total
f4ef119214a3c39c7c560ad24491b96c stdin
I'd like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.
For example, let's look at the character "è". In ISO-8859-1 (http://www.ascii-code.com/), that's e8 as hex. In UTF-8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char), it's c3 a8.
So let's say we have iso.txt, which contains our letter and EOL:
$ hexdump iso.txt
0000000 e8 0a
0000002
Now we can convert it to UTF-8 like this:
$ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump
0000000 c3 a8 0a
0000003
How should I write something equivalent in clojure? I'm happy to use any external libraries, but I don't know where I'd go to find them. Looking around I couldn't figure out how to use libiconv itself on the JVM, but there's probably an alternative?
Edit
After reading Alex's link in the comment, this is so simple and so cool:
user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8")
"è"
user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1")
"è"
If you want a simple whole-file conversion to UTF-8, slurp allows for specifying the file encoding with the :encoding option and spit will output UTF-8 by default. This method will read the entire file into memory, so large files might require a different approach.
$ printf "\xe8\n" > iso.txt
$ hexdump iso.txt
0000000 e8 0a
0000002
(spit "/Users/path/iso2.txt"
(slurp "/Users/path/iso.txt" :encoding "ISO-8859-1"))
$ hexdump iso2.txt
0000000 c3 a8 0a
0000003
Note: slurp will assume UTF-8 if you do not specify an encoding.
I am am using Perl open for opening new file on Solaris 10 as follows:
open($fh, ">$filePath");
What is default file character encoding on my system with this call?
The output from locale command is given below
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
This was not as easy a question to answer as I thought it would be.
The default encoding is raw, which is suitable for binary data. Any character with an ordinal value under 256 is passed as is:
$ perl -e 'print chr(0xFF)' | od -c
00000000 377
00000001
The curious thing is what happens when you try to write a character above ordinal value 255. Then it looks like you get UTF-8 encoding.
$ perl -e 'print chr(0x100)' | od -c
00000000 304 200
00000002
I don't know where or if this behavior is documented.
How would an awk script (presumably a one-liner) for removing a BOM look like?
Specification:
print every line after the first (NR > 1)
for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest
Using GNU sed (on Linux or Cygwin):
# Removing BOM from all text files in current directory:
sed -i '1 s/^\xef\xbb\xbf//' *.txt
On FreeBSD:
sed -i .bak '1 s/^\xef\xbb\xbf//' *.txt
Advantage of using GNU or FreeBSD sed: the -i parameter means "in place", and will update files without the need for redirections or weird tricks.
On Mac:
This awk solution in another answer works, but the sed command above does not work. At least on Mac (Sierra) sed documentation does not mention supporting hexadecimal escaping ala \xef.
A similar trick can be achieved with any program by piping to the sponge tool from moreutils:
awk '…' INFILE | sponge INFILE
Try this:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}{print}' INFILE > OUTFILE
On the first record (line), remove the BOM characters. Print every record.
Or slightly shorter, using the knowledge that the default action in awk is to print the record:
awk 'NR==1{sub(/^\xef\xbb\xbf/,"")}1' INFILE > OUTFILE
1 is the shortest condition that always evaluates to true, so each record is printed.
Enjoy!
-- ADDENDUM --
Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:
Bytes | Encoding Form
--------------------------------------
00 00 FE FF | UTF-32, big-endian
FF FE 00 00 | UTF-32, little-endian
FE FF | UTF-16, big-endian
FF FE | UTF-16, little-endian
EF BB BF | UTF-8
Thus, you can see how \xef\xbb\xbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.
Not awk, but simpler:
tail -c +4 UTF8 > UTF8.nobom
To check for BOM:
hd -n 3 UTF8
If BOM is present you'll see: 00000000 ef bb bf ...
In addition to converting CRLF line endings to LF, dos2unix also removes BOMs:
dos2unix *.txt
dos2unix also converts UTF-16 files with a BOM (but not UTF-16 files without a BOM) to UTF-8 without a BOM:
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16be>bom-utf16be
$ printf '\ufeffä\n'|iconv -f utf-8 -t utf-16le>bom-utf16le
$ printf '\ufeffä\n'>bom-utf8
$ printf 'ä\n'|iconv -f utf-8 -t utf-16be>utf16be
$ printf 'ä\n'|iconv -f utf-8 -t utf-16le>utf16le
$ printf 'ä\n'>utf8
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be feff00e4000a
bom-utf16le fffee4000a00
bom-utf8 efbbbfc3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
$ dos2unix -q *
$ for f in *;do printf '%11s %s\n' $f $(xxd -p $f);done
bom-utf16be c3a40a
bom-utf16le c3a40a
bom-utf8 c3a40a
utf16be 00e4000a
utf16le e4000a00
utf8 c3a40a
I know the question was directed at unix/linux, thought it would be worth to mention a good option for the unix-challenged (on windows, with a UI).
I ran into the same issue on a WordPress project (BOM was causing problems with rss feed and page validation) and I had to look into all the files in a quite big directory tree to find the one that was with BOM. Found an application called Replace Pioneer and in it:
Batch Runner -> Search (to find all the files in the subfolders) -> Replace Template -> Binary remove BOM (there is a ready made search and replace template for this).
It was not the most elegant solution and it did require installing a program, which is a downside. But once I found out what was going around me, it worked like a charm (and found 3 files out of about 2300 that were with BOM).