I'm not sure of the best way to describe this.
Essentially I am attempting to write to a buffer which requires a certain protocol. The first two bytes I would like to are "10000001" and "11111110" (bit by bit). How can I write these two bytes to a file handle in Perl?
To convert spelled-out binary to actual bytes, you want the pack function with either B or b (depending on the order you have the bits in):
print FILE pack('B*', '1000000111111110');
However, if the bytes are constant, it's probably better to convert them to hex values and use the \x escape with a string literal:
print FILE "\x81\xFE";
How about
# open my $fh, ...
print $fh "\x81\xFE"; # 10000001 and 11111110
Since version 5.6.0 (released in March 2000), perl has supported binary literals as documented in perldata:
Numeric literals are specified in any of the following floating point or integer formats:
12345
12345.67
.23E-10 # a very small number
3.14_15_92 # a very important number
4_294_967_296 # underscore for legibility
0xff # hex
0xdead_beef # more hex
0377 # octal (only numbers, begins with 0)
0b011011 # binary
You are allowed to use underscores (underbars) in numeric literals between digits for legibility. You could, for example, group binary digits by threes (as for a Unix-style mode argument such as 0b110_100_100) or by fours (to represent nibbles, as in 0b1010_0110) or in other groups.
You may be tempted to write
print $fh 0b10000001, 0b11111110;
but the output would be
129254
because 10000001₂ = 129₁₀ and 11111110₂ = 254₁₀.
You want a specific representation of the literals’ values, namely as two unsigned bytes. For that, use pack with a template of "C2", i.e., octet times two. Adding underscores for readability and wrapping it in a convenient subroutine gives
sub write_marker {
my($fh) = #_;
print $fh pack "C2", 0b1000_0001, 0b1111_1110;
}
As a quick demo, consider
binmode STDOUT or die "$0: binmode: $!\n"; # we'll send binary data
write_marker *STDOUT;
When run as
$ ./marker-demo | od -t x1
the output is
0000000 81 fe
0000002
In case it’s unfamiliar, the od utility is used here for presentational purposes because the output contains a control character and Þ (Latin small thorn) in my system’s encoding.
The invocation above commands od to render in hexadecimal each byte from its input, which is the output of marker-demo. Note that 10000001₂ = 81₁₆ and 11111110₂ = FE₁₆. The numbers in the left-hand column are offsets: the special marker bytes start at offset zero (that is, immediately), and there are exactly two of them.
Related
In a Perl script of mine, I have to write a mix of UTf-8 and raw bytes into files.
I have a big string in which everything is encoded as UTF-8. In that "source" string, UTF-8 characters are just like they should be (that is, UTF-8-valid byte sequences), while the "raw bytes" have been stored as if they were codepoints of the value held by the raw byte. So, in the source string, a "raw" byte of 0x50 would be stored as one 0x50 byte; whereas a "raw" byte of 0xff would be stored as a 0xc3 0xbf two-byte utf-8-valid sequence. When I write these "raw" bytes back, I need to put them back to single-byte form.
I have other data structures allowing me to know which parts of the string represent what kind of data. A list of fields, types, lengths, etc.
When writing in a plain file, I write each field in turn, either directly (if it's UTF-8) or by encoding its value to ISO-8859-1 if it's meant to be raw bytes. It works perfectly.
Now, in some cases, I need to write the value not directly to a file, but as a record of a BerkeleyDB (Btree, but that's mostly irrelevant) database.
To do that, I need to write ALL the values that compose my record, in a single write operation. Which means that I need to have a scalar that holds a mix of UTF-8 and raw bytes.
Example:
Input Scalar (all hex values): 61 C3 8B 00 C3 BF
Expected Output Format: 2 UTF-8 characters, then 2 raw bytes.
Expected Output: 61 C3 8B 00 FF
At first, I created a string by concatenating the same values I was writing to my file from an empty string. And I tried writing that very string to a "standard" file without adding encoding. I got '?' characters instead of all my raw bytes over 0x7f (because, obviously, Perl decided to consider my string to be UTF-8).
Then, to try and tell Perl that it was already encoded, and to "please not try to be smart about it", I tried to encode the UTF-8 parts into "UTF-8", encode the binary parts into "ISO-8859-1", and concatenate everything. Then I wrote it. This time, the bytes looked perfect, but the parts which were already UTF-8 had been "double-encoded", that is, each byte of a multi-byte character had been seen as its codepoint...
I thought Perl wasn't supposed to re-encode "internal" UTF-8 into "encoded" UTF-8, if it was internally marked as UTF-8. The string holding all the values in UTF-8 comes from a C API, which sets the UTF-8 marker (or is supposed to, at the very least), to let Perl know it is already decoded.
Any idea about what I did miss there?
Is there a way to tell Perl what I want to do is just put a bunch of bytes one after another, and to please not try to interpret them in any way? The file I write to is opened as ">:raw" for that very reason, but I guess I need a way to specify that a given scalar is "raw" too?
Epilogue: I found the cause of the problem. The $bigInputString was supposed to be entirely composed of UTF-8 encoded data. But it did contain raw bytes with big values, because of a bug in C (turns out a "char" (not "unsigned char") is best tested with bitwise operators, instead of a " > 127"... ahem). So, "big" bytes weren't split into a two-bytes UTF-8 sequence, in the C API.
Which means the $bigInputString, created from the bad C data, didn't have the expected contents, and Perl rightfully didn't like it either.
After I corrected the bug, the string correctly encoded to UTF-8 (for the parts I wanted to keep as UTF-8) or LATIN-1 (for the "raw bytes" I wanted to convert back), and I got no further problems.
Sorry for wasting your time, guys. But I still learned some things, so I'll keep this here. Moral of the story, Devel::Peek is GOOD for debugging (thanks ikegami), and one should always double check, instead of assuming. Granted, I was in a hurry on friday, but the fault is still mine.
So, thanks to everyone who helped, or tried to, and special thanks to ikegami (again), who used quite a bit of his time helping me.
Assuming you have a Unicode string where you know what each codepoint is supposed to be stored as - a UTF-8 sequence or a single byte, and a way to create a template string where each character represents what the corresponding one of the unicode string is supposed to use (U for UTF-8, C for single byte to keep things simple), you can use pack:
#!/usr/bin/env perl
use strict;
use warnings;
sub process {
my ($str, $formats) = #_;
my $template = "C0$formats";
my #chars = map { ord } split(//, $str);
pack $template, #chars;
}
my $str = "\x61\xC3\x8B\x00\xC3\xBF";
utf8::decode($str);
print process($str, "UUCC"); # Outputs 0x61 0xc3 0x8b 0x00 0xff
So you have
my $in = "\x61\xC3\x8B\x00\xC3\xBF";
and you want
my $out = "\x61\xC3\x8B\x00\xFF";
This is the result of decoding only some parts of the input string, so you want something like the following:
sub decode_utf8 { my ($s) = #_; utf8::decode($s) or die("Invalid Input"); $s }
my $out = join "",
substr($in, 0, 3),
decode_utf8(substr($in, 3, 1)),
decode_utf8(substr($in, 4, 2));
Tested.
Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.
sub encode_utf8 { my ($s) = #_; utf8::encode($s); $s }
utf8::decode($in) or die("Invalid Input");
my $out = join "",
encode_utf8(substr($in, 0, 2)),
substr($in, 2, 1),
substr($in, 3, 1);
Tested.
You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.
I'm using MARC::Lint to lint some MARC records, but every now and them I'm getting an error (on about 1% of the files):
utf8 "\xCA" does not map to Unicode at /usr/lib/x86_64-linux-gnu/perl/5.26/Encode.pm line 212.
The problem is that I've tried different methods but cannot find "\xCA" in the file...
My script is:
#!perl -w
use MARC::File::USMARC;
use MARC::Lint;
use utf8;
use open OUT => ':utf8';
my $lint = new MARC::Lint;
my $filename = shift;
my $file = MARC::File::USMARC->in( $filename );
while ( my $marc = $file->next() ) {
$lint->check_record( $marc );
# Print the errors that were found
print join( "\n", $lint->warnings ), "\n";
} # while
and the file can be downloaded here: http://eroux.fr/I14376.mrc
Is "\xCA" hidden somewhere? Or is this a bug in MARC::Lint?
The problem has nothing to do with MARC::Lint. Remove the lint check, and you'll still get the error.
The problem is a bad data file.
The file contains a "directory" of where the information is located in the file. The following is a human-readable rendition of the directory for the file you provided:
tagno|offset|len # Offsets are from the start of the data portion.
001|00000|0017 # Length include the single-byte field terminator.
006|00017|0019 # Offset and lengths are in bytes.
007|00036|0015
008|00051|0041
035|00092|0021
035|00113|0021
040|00134|0018
050|00152|0022
066|00174|0009
245|00183|0101
246|00284|0135
264|00419|0086
300|00505|0034
336|00539|0026
337|00565|0026
338|00591|0036
546|00627|0016
500|00643|0112
505|00755|9999 <--
506|29349|0051
520|29400|0087
533|29487|0115
542|29602|0070
588|29672|0070
653|29742|0013
710|29755|0038
720|29793|0130
776|29923|0066
856|29989|0061
880|30050|0181
880|30231|0262
Notice the length of the field with tag 505, 9999. This is the maximum value supported (because the length is stored as four decimal digits). The catch is that value of that field is far larger than 9,999 bytes; it's actually 28,594 bytes in size.
What happens is that the module extracts 9,999 bytes rather than 28,594. This happens to cut a UTF-8 sequence in half. (The specific sequence is CA BA, the encoding of ʼ.) Later, when the module attempts to decode that text, an error is thrown. (CA must be followed by another byte to be valid.)
Are these records you are generating? If so, you need to make sure that no field requires more than 9,999 bytes.
Still, the module should handle this better. It could read until it finds a end-of-field marker instead of using the length when it finds no end-of-field marker where it expects one and/or it could handle decoding errors in a non-fatal manner. It already has a mechanism to report these problems ($marc->warnings).
In fact, if it hadn't died (say if the cut happened to occur in between characters instead of in the middle of one), $marc->warnings would have returned the following message:
field does not end in end of field character in tag 505 in record 1
$hexnumber = 0x09C343C2E95ACABA;
print ("$hexnumber");
I am getting printed response as 703480471217687226. Can any one help me in solving this issue.
That's expected. What's your question? If you turn on warnings you will notice below warning:
Hexadecimal number > 0xffffffff non-portable at test.pl line 4.
You may use bigint pragma which replaces the hex function with a version that can handle numbers that large.
use bigint qw/hex/;
hex
Override the built-in hex() method with a version that can handle
big integers. This overrides it by exporting it to the current
package. Under Perl v5.10.0 and higher, this is not so necessary, as
hex() is lexically overridden in the current scope whenever the bigint
pragma is active. -perldoc
You can print out numbers in hexidecimal using printf:
printf "%X", $hexnumber; # or %x for lower case a-f digits
but note that this isn't portable to perls that use 32-bit integers.
Perhaps you really just want to store it as a string in the first place?
I'm relatively inexperienced with Perl, but my question concerns the unpack function when getting the bits for a numeric value. For example:
my $bits = unpack("b*", 1);
print $bits;
This results in 10001100 being printed, which is 140 in decimal. In the reverse order it's 49 in decimal. Any other values I've tried seem to give the incorrect bits.
However, when I run $bits through pack, it produces 1 again. Is there something I'm missing here?
It seems that I jumped to conclusions when I thought my problem was solved. Maybe I should briefly explain what it is I'm trying do.
I need to convert an integer value that could be as big as 24 bits long (the point being that it could be bigger than one byte) into a bit string. This much can be accomplished using unpack and pack as suggested by #ikegami, but I also need to find a way to convert that bit string back into it's original integer (not a string representation of it).
As I mentioned, I'm relatively inexperienced with Perl, and I've been trying with no success.
I found what seems to be an optimal solution:
my $bits = sprintf("%032b", $num);
print "$bits\n";
my $orig = unpack("N", pack("B32", substr("0" x 32 . $bits, -32)));
print "$orig\n";
This might be obvious, but the other answers haven't pointed it out explicitly: The second argument in unpack("b*", 1) is being typecast to the string "1", which has an ASCII value of 31 in hex (with the most significant nibble first).
The corresponding binary would be 00110001, which is reversed to 10001100 in your output because you used "b*" instead of "B*". These correspond to the opposite "endian" forms of the binary representation. "Endian-ness" is just whether the most-significant bits go at the start or the end of the binary representation.
Yes, you're missing that different machines support different "endianness". And Perl is treating 1 like '1' so ( 0x31 ). So, you're seeing 1 -> 1000 (in ascending order) and 3 -> 1100.
"Wrong" depends on perspective and whether or not you gave Perl enough information to know what encoding and endianness you wanted.
From pack:
b A bit string (ascending bit order inside each byte, like vec()).
B A bit string (descending bit order inside each byte).
I think this is what you want:
unpack( 'B*', chr(1))
You're trying to convert an integer to binary and then back. While you can do that with pack and then unpack, the better way is to use sprintf or printf with the %b format:
my $int = 5;
my $bits = sprintf "%024b\n", $int;
print "$bits\n";
To go the other way (converting a string of 0s & 1s to an integer), the best way is to use the oct function with a 0b prefix:
my $orig = oct("0b$bits");
print "$orig\n";
As the others explained, unpack expects a string to unpack, so if you have an integer, you first have to pack it into a string. The %b format expects an integer to begin with.
If you need to do a lot of this on bytes, and speed is crucial, you could build a lookup table:
my #binary = map { sprintf '%08b', $_ } 0 .. 255;
print $binary[$int]; # Assuming $int is between 0 and 255
The ord(1) is 49. You must want something like sprintf("%064b", 1), although that does seem like overkill.
You didn't specify what you expect. I'm guessing you're expecting 00000001.
That's the correct bits for the byte you provided, at least on non-EBCDIC systems. Remember, the input of unpack is a string (mostly strings of bytes). Perhaps you wanted
unpack('b*', pack('C', 1))
Update: As others have pointed out, the above gives 10000000. For 00000001, you'd use
unpack('B*', pack('C', 1)) # 00000001
You want "B" instead of "b".
$ perl -E'say unpack "b*", "1"'
10001100
$ perl -E'say unpack "B*", "1"'
00110001
pack
We use Template Toolkit in a Catalyst app. We configured TT to use UTF-8 and had no problems with it before.
Now I call the substr() method of a string var. Unfortunately it does split the string after n bytes instead of n chars. If the n'th and (n+1)'th byte build a unicode char it is split and only the 1st byte is part of substr() result.
How to fix or workaround that behaviour?
[% string = "fööbär";
string.length; # prints 9
string.substr(0, 5); # prints "föö" (1 ascii + 2x 2 byte unicode)
string.substr(0, 4): # prints "fö?" (1 ascii, 1x 2 byte unicode, 1 unknown char)
%]
Until now we had no problems with Unicode chars, neither ones comes from the database nor text in the templates.
Edit: This is how I configure the Catalyst::View::TT module in my Catalyst app:
__PACKAGE__->config(
# DEBUG => DEBUG_ALL,
DEFAULT_ENCODING => 'utf-8',
INCLUDE_PATH => My::App->path_to( 'root', 'templates' ),
TEMPLATE_EXTENSION => '.tt',
WRAPPER => "wrapper/default.tt",
render_die => 1,
);
I did a quick testing with Perl 1.12.2 for MSWin32 Template module.
It could handle all these substr operation properly.
This is my test code:
use Template;
# some useful options (see below for full list)
my $config = {
# DEFAULT_ENCODING => 'utf-8',
INCLUDE_PATH => 'd:/devel/perl', # or list ref
INTERPOLATE => 1, # expand "$var" in plain text
EVAL_PERL => 1, # evaluate Perl code blocks
};
# create Template object
my $template = Template->new($config);
# define template variables for replacement
my $vars = {
var1 => "abcdef"
};
# specify input filename, or file handle, text reference, etc.
my $input = 'ttmyfile.txt';
# process input template, substituting variables
print $template->process($input, $vars);
ttmyfile.txt
Var = [% var1 %]
[% string = "fööbär" -%]
[% string.length %] # prints 6
[% string.substr(0, 5) %] # prints "fööbä"
[% string.substr(0, 4) %] # prints "fööb"
Output:
Var = abcdef
6 # prints 6
fööbä # prints "fööbä"
fööb # prints "fööb"
1
All works fine, even without use utf8 nor DEFAULT_ENCODING. Key things here:
Make sure your template .tt files are encoded as UTF8 with BOM -- Byte Order Mark. This is a must do task! Because Template-Toolkit is detect the Unicode file encoding according to BOM.
You can use Windows Notepad to save a file with BOM, just do File --> Save --> Encoding: "UTF-8".
You can use VIM make it as well by input set fenc=utf8 and set bomb, then save the file, the file will start with BOM.
Set the NCODING paramter Template->new({NCODING => 'utf-8'}); as 'utf-8' will enforce Template to load template file as 'utf-8'.
Suggest to have use utf8 in your script, it will ensure all your inline strings are encoding as utf8 properly.
Because Catalyst::View::TT rely on Template, I believe it should be working as well! Good luck~~~
The Wikipedia article on UTF-8 provides a table that shows how non-ASCII characters are encoded. That table illustrates the following simple rules for UTF-8:
If the highest bit of a byte is 0, then the byte denotes an ASCII character.
If the two highest bits of a byte are 11, then this is the start of a multi-byte character, and the number of consecutive 1 bits starting from the highest order bit indicates the total number of bytes in the multi-byte character. Thus, a byte whose bit representation is 110xxxxx is the start of a 2-byte character, 1110xxxx is the start of a 3-byte character, and 11110xxx is the start of a 4-byte character. (You can ignore the hypothetical 5-byte and 6-byte characters because Unicode is limited to being a 21-bit character set rather than a 32-bit character set.)
If the two highest bits of a byte are 10, then this byte is part of a multi-byte character (but not the first byte of that character).
That information should be enough for you to write your own utility functions that are like string.length and string.substring() but work in terms of characters instead of bytes.
Update: The question did not specify the programming language being used, and I was not aware that "Template Toolkit" implied the use of Perl. Once I realised that, I did a Google search and discovered that your problem is likely to be due to the need to add a use utf8 directive to your source code. You can find a discussion about this here.
The answer is pretty simple (in Perl), fortunately:
use Encode qw{encode decode};
The way this works is that you decode Unicode strings into Perl strings, whereupon you can use substr() and length() the way you expect, and then you encode them again for output.
With that header:
# $unicodeString = 'fööbär';
my $perlString = decode('UTF-8', $unicodeString);
printf "%d\n", length($perlString); # should be 6
printf "%s\n", substr($perlString, 0, 3); # should be 'föö'
# whatever other processing you want here with $perlString . . .
# Then, you want to reencode that back to a proper UTF-8 string:
my $unicodeString = encode('UTF-8', $perlString);
Would that help?