say pack "A*", "asdf"; # Prints "asdf"
say pack "s", 0x41 * 256 + 0x42; # Prints "BA" (0x41 = 'A', 0x42 = 'B')
The first line makes sense: you're taking an ASCII encoded string, packing it into a string as an ASCII string. In the second line, the packed form is "\x42\x41" because of the little endian-ness of short integers on my machine.
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number, since that's how (I assume) Perl stores numbers, as little-endian sequence of bytes. Is there a way to do so without unpacking it? I'm trying to get the correct mental model for the thing that pack() returns.
For instance, in C, I can do this:
#include <stdio.h>
int main(void) {
char c[2];
short * x = c;
c[0] = 0x42;
c[1] = 0x41;
printf("%d\n", *x); // Prints 16706 == 0x41 * 256 + 0x42
return 0;
}
If you're really interested in how Perl stores data internally, I'd recommend PerlGuts Illustrated. But usually, you don't have to care about stuff like that because Perl doesn't give you access to such low-level details. These internals are only important if you're writing XS extensions in C.
If you want to "cast" a two-byte string to a C short, you can use the unpack function like this:
$ perl -le 'print unpack("s", "BA")'
16706
However, I can't shake the feeling that somehow, I should be able to treat the packed string from the second line as a number,
You need to unpack it first.
To be able to use it as a number in C, you need
char* packed = "\x42\x41";
int16_t int16;
memcpy(&int16, packed, sizeof(int16_t));
To be able to use it as a number in Perl, you need
my $packed = "\x42\x41";
my $num = unpack('s', $packed);
which is basically
use Inline C => <<'__EOI__';
SV* unpack_s(SV* sv) {
STRLEN len;
char* buf;
int16_t int16;
SvGETMAGIC(sv);
buf = SvPVbyte(sv, len);
if (len != sizeof(int16_t))
croak("usage");
Copy(buf, &int16, 1, int16_t);
return newSViv(int16);
}
__EOI__
my $packed = "\x42\x41";
my $num = unpack_s($packed);
since that's how (I assume) perl stores numbers, as little-endian sequence of bytes.
Perl stores numbers in one of following three fields of a scalar:
IV, a signed integer of size perl -V:ivsize (in bytes).
UV, an unsigned integer of size perl -V:uvsize (in bytes). (ivsize=uvsize)
NV, a floating point numbers of size perl -V:nvsize (in bytes).
In all case, native endianness is used.
I'm trying to get the correct mental model for the thing that pack() returns.
pack is used to construct "binary data" for interfacing with external APIs.
I see pack as a serialization function. It takes as input Perl values, and outputs a serialized form. The fact the output serialized form happens to be a Perl bytestring is more of an implementation detail than a core functionality.
As such, all you're really expected to do with the resulting string is feed it to unpack, though the serialized form is convenient to have it move around processes, hosts, planets.
If you're interested in serializing it to a number instead, consider using vec:
say vec "BA", 0, 16; # prints 16961
To take a closer look at the string's internal representation, take a look at Devel::Peek, though you're not going to see anything surprising with a pure ASCII string.
use Devel::Peek;
Dump "BA";
SV = PV(0xb42f80) at 0xb56300
REFCNT = 1
FLAGS = (POK,READONLY,pPOK)
PV = 0xb60cc0 "BA"\0
CUR = 2
LEN = 16
Related
My objective is to take a character which represents to UK pound symbol and convert it to it's unicode equivalent in a string.
Here's my code and output so far from my test program:
#include <iostream>
#include <stdio.h>
int main()
{
char x = 163;
unsigned char ux = x;
const char *str = "\u00A3";
printf("x: %d\n", x);
printf("ux: %d %x\n", ux, ux);
printf("str: %s\n", str);
return 0;
}
Output
$ ./pound
x: -93
ux: 163 a3
str: £
My goal is to take the unsigned char 0xA3 and put it into a string representing the unicode UK pound representation: "\u00A3"
What exactly is your question? Anyway, you say you're writing C++, but you're using char* and printf and stdlib.h so you're really writing C, and base C does not support unicode. Remember that a char in C is not a "character" it's just a byte, and a char* is not an array of characters, it's an array of bytes. When you printf the "\u00A3" string in your sample program, you are not printing a unicode character, you are actually printing those literal bytes, and your terminal is helping you out and interpreting them as a unicode character. The fact that it correctly prints the £ character is just coincidence. You can see this for yourself. If you printf str[0] in your sample program you should just see the "\" character.
If you want to use unicode correctly in C you'll need to use a library. There are many to choose from and I haven't used any of them enough to recommend one. Or you'll need to use C++11 or newer and use std::wstring and friends. But what you are doing is not real unicode and will not work as you expect in the long run.
I understand that you have a hex string and perform SHA256 on it twice and then byte-swap the final hex string. The goal of this code is to find a Merkle Root by concatenating two transactions. I would like to understand what's going on in the background a bit more. What exactly are you decoding and encoding?
import hashlib
transaction_hex = "93a05cac6ae03dd55172534c53be0738a50257bb3be69fff2c7595d677ad53666e344634584d07b8d8bc017680f342bc6aad523da31bc2b19e1ec0921078e872"
transaction_bin = transaction_hex.decode('hex')
hash = hashlib.sha256(hashlib.sha256(transaction_bin).digest()).digest()
hash.encode('hex_codec')
'38805219c8ac7e9a96416d706dc1d8f638b12f46b94dfd1362b5d16cf62e68ff'
hash[::-1].encode('hex_codec')
'ff682ef66cd1b56213fd4db9462fb138f6d8c16d706d41969a7eacc819528038'
header_hex is a regular string of lower case ASCII characters and the decode() method with 'hex' argument changes it to a (binary) string (or bytes object in Python 3) with bytes 0x93 0xa0 etc. In C it would be an array of unsigned char of length 64 in this case.
This array/byte string of length 64 is then hashed with SHA256 and its result (another binary string of size 32) is again hashed. So hash is a string of length 32, or a bytes object of that length in Python 3. Then encode('hex_codec') is a synomym for encode('hex') (in Python 2); in Python 3, it replaces it (so maybe this code is meant to work in both versions). It outputs an ASCII (lower hex) string again that replaces each raw byte (which is just a small integer) with a two character string that is its hexadecimal representation. So the final bit reverses the double hash and outputs it as hexadecimal, to a form which I usually call "lowercase hex ASCII".
I'm trying to scan in a float: 13.8518009935297 .
The first routine is my own, the second is MacOSX libc's
strtod, the third is GMP's mpf_get_d() the forth is
perls numeric.c:Perl_my_atof2().
I use this snippet to print the mantissa:
union ieee_double {
struct {
uint32_t fracl;
uint32_t frach:20;
uint32_t exp:11;
uint32_t sign:1;
} s;
double d;
uint64_t l;
};
union ieee_double l0;
l0.d = ....
printf("... 0x%x 0x%x\n", l0.s.frach, l0.s.fracl);
The return values for the four functions are:
my-func : 0xbb41f 0x4283d21b
strtod : 0xbb41f 0x4283d21c
GMP : 0xbb41f 0x4283d21b
perl : 0xbb41f 0x4283d232
The difference between the first three functions is rounding.
However perl's mantissa is quite out of sync.
If I print all four doubles to a string again I get the
same decimal double back, the numbers seem to be equal.
My question:
The difference between my-func, strtod, GMP is rounding. However,
why is perl's mantissa so much out of sync, but still, if
converted back to decimal, it ends up as the same number again.
The difference is 22, so it should be noted in a decimal
fraction. How can I explain this?
Append:
Sorry, I think I figured out the problem:
$r = rand(25);
$t = $p->tokenize_str("$r");
tokenize_str() was my implementation of a conversion from string to double.
However the perl stringify "$r" prints out $r as 13.8518009935297, which is a
already truncation.
The actual value of $r is different, so when I at the end the binaries of
$t with $r I get values that diverge.
Here is some perl code to answer your question:
perl -le '($frac1, $frach)=unpack("II", pack "d", .0+"13.8518009935297");
print sprintf("%d %d 0x%03x 0x%04x", ($frach >> 31)&1, ($frach>>20)&0x5ff, $frach & 0xfffff, $frac1)'
-> 0 1026 0xbb41f 0x4283d21c
Perl gives the same result as strtod. The difference was the mistake you indicated in append.
i'm trying very hard on implementing the sha-256 algorithm. I have got problems with the padding of the message. for sha-256 you have to append one bit at the end of the message, which I have reached so far with $message .= (chr 0x80);
The next step should be to fill the emtpy space(512bit block) with 0's.
I calculated it with this formula: l+1+k=448-l and append it then to the message.
My problem comes now:Append in the last 64bit block the binary representation of the length of the message and fill the rest with 0's again. Since perl handles their data types by themself, there is no "byte" datatype. How can I figure out which value I should append?
please see also the official specification:
http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf
If at all possible, pull something off the shelf. You do not want to roll your own SHA-256 implementation because to get official blessing, you would have to have it certified.
That said, the specification is
5.1.1 SHA-1, SHA-224 and SHA-256
Suppose that the length of the message, M, is l bits. Append the bit 1 to the end of the message, followed by k zero bits, where k is the smallest, non-negative solution to the equation
l + 1 + k ≡ 448 mod 512
Then append the 64-bit block that is equal to the number l expressed using a binary representation. For example, the (8-bit ASCII) message “abc” has length 8 × 3 = 24, so the message is padded with a one bit, then 448 - (24 + 1) = 423 zero bits, and then the message length, to become the 512-bit padded message
423 64
.-^-. .---^---.
01100001 01100010 01100011 1 00…00 00…011000
“a” “b” “c” '-v-'
l=24
Then length of the padded message should now be a multiple of 512 bits.
You might be tempted to use vec because it allows you to address single bits, but you would have to work around funky addressing.
If bits is 4 or less, the string is broken into bytes, then the bits of each byte are broken into 8/BITS groups. Bits of a byte are numbered in a little-endian-ish way, as in 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80. For example, breaking the single input byte chr(0x36) into two groups gives a list (0x6, 0x3); breaking it into 4 groups gives (0x2, 0x1, 0x3, 0x0).
Instead, a pack template of B* specifies
A bit string (descending bit order inside each byte).
and N
An unsigned long (32-bit) in "network" (big-endian) order.
The latter is useful for assembling the message length. Although pack has a Q parameter for quad, the result is in the native order.
Start with a bit of prep work
our($UPPER32BITS,$LOWER32BITS);
BEGIN {
use Config;
die "$0: $^X not configured for 64-bit ints"
unless $Config{use64bitint};
# create non-portable 64-bit masks as constants
no warnings "portable";
*UPPER32BITS = \0xffff_ffff_0000_0000;
*LOWER32BITS = \0x0000_0000_ffff_ffff;
}
Then you can defined pad_message as
sub pad_message {
use bytes;
my($msg) = #_;
my $l = bytes::length($msg) * 8;
my $extra = $l % 512; # pad to 512-bit boundary
my $k = 448 - ($extra + 1);
# append 1 bit followed by $k zero bits
$msg .= pack "B*", 1 . 0 x $k;
# add big-endian length
$msg .= pack "NN", (($l & $UPPER32BITS) >> 32), ($l & $LOWER32BITS);
die "$0: bad length: ", bytes::length $msg
if (bytes::length($msg) * 8) % 512;
$msg;
}
Say the code prints the padded message with
my $padded = pad_message "abc";
# break into multiple lines for readability
for (unpack("H*", $padded) =~ /(.{64})/g) {
print $_, "\n";
}
Then the output is
6162638000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000018
which matches the specification.
First of all I hope you do this just as an exercise -- there is a Digest module in core that already computes SHA-256 just fine.
Note that $message .= (chr 0x80); appends one byte, not one bit. If you really need bitwise manipulation, take a look at the vec function.
To get the binary representation of an intger, you should use pack. To get it to 64 bit, do something like
$message .= pack 'Q', length($message)
Note that the 'Q' format is only available on 64 bit perls; if yours isn't one, simply concatenate four 0-bytes with a 32 bit value (pack format L).
I adopt select(), sysread(), syswrite() mechanism to handle socket messages where messages are sysread() into $buffer (binary) before they are syswritten.
Now I want to change two bytes of the message, which denote the length of the whole message. At first, I use following code:
my $msglen=substr($buffer,0,2); # Get the first two bytes
my $declen=hex($msglen);
$declen += 3;
substr($buffer,0,2,$declen); # change the length
However, it doesn't work in this way. If the final value of $declen is 85, then the modified $buffer will be "0x35 0x35 0x00 0x02...". I insert digital number to $buffer but finally got ASCII!
I also tried this way:
my $msglen=substr($buffer,0,2); # Get the first two bytes,binary
$msglen += 0b11; # Or $msglen += 3;
my $msgbody=substr($buffer,2); # Get the rest part of message, binary
$buffer=join("", $msglen, $msgbody);
Sadly, this method also failed. The result is such as"0x33 0x 0x00 0x02..." I just wonder why two binary scalar can't be joined into a binary scalar?
Can you help me? Thank you!
my $msglen=substr($buffer,0,2); # Get the first two bytes
my $number = unpack("S",$msglen);
$number += 3;
my $number_bin = pack("S",$number);
substr($buffer,0,2,$number_bin); # change the length
Untested, but I think this is what you are trying to do... convert a string with two bytes representing a short int into an actual int object and then back again.
I have found another workable way -- using vec
vec($buffer, 0, 16) += 3;
You cannot join two binary buffers in Perl directly all you have to do is call pack to get an ASCII and then join it and call unpack on it to get back.