Encoding of ... some sort? - encoding

Forgive me if this has been asked before, but I assure you I've scoured the internet and have turned up nothing, probably because I don't have the right terminology.
I would like to take an integer and convert it to a little-endian(?) hex representation like this:
303 -> 0x2f010000
I can see that the bytes are packed such that the 16's and 1's places are both in the same byte, and that the 4096's place and 256's place share a byte. If someone could just point me to the right terminology for such encoding, I'm sure I could find my answer on how to do it. Thanks!

use bit shift operators combined with bitwise AND and OR operators...
assuming 32 bit unsigned:
int value = 303;
int result = 0x00000000;
for (int i = 0; i < 4; i++)
{
result = result | ((value & (0xFF << (i * 8))) << (24 - (i * 8)));
}

Big-endian and little-endian refer to the order of the bytes in memory. A value like 0x2f100000 has no intrinsic endianness, endianness depends on the CPU architecture.
If you always want to reverse the order of the bytes in a 32-bit value, use the code that Demi posted.
If you always want to get the specific byte order (because you're preparing to transfer those bytes over the network, or store them to a disk file), use something else. E.g. the BSD sockets library has a function htonl() that takes your CPU's native 32-bit value and puts it into big-endian order.
If you're running on a little-endian machine, htonl(303) == 0x2f100000. If you're running on a big-endian machine, htonl(303) == 303. In both cases the result will be represented by bytes [0x00, 0x00, 0x01, 0x2f] in memory.

If anyone can put a specific term to what I was trying to do, I'd still love to hear it. I did, however, find a way to do what I needed, and I'll post it here so if anyone comes looking after me, they can find it. There may be (probably is) an easier, more direct way to do it, but here's what I ended up doing in VB.Net to get back the bytecode I wanted:
Private Function Encode(ByVal original As Integer) as Byte()
Dim twofiftysixes As Integer = CInt(Math.Floor(original / 256))
Dim sixteens As Integer = CInt(Math.Floor((original - (256 * twofiftysixes)) / 16))
Dim ones As Integer = original Mod 16
Dim bytecode As Byte() = {CByte((16 * sixteens) + ones), CByte(twofiftysixes), 0, 0}
Return bytecode
End Function
Effectively breaking the integer up into its hex components, then converting the appropriate pairs to cBytes.

Related

perl6: Cannot unbox 65536 bit wide bigint into native integer

I try some examples from Rosettacode and encounter an issue with the provided Ackermann example: When running it "unmodified" (I replaced the utf-8 variable names by latin-1 ones), I get (similar, but now copyable):
$ perl6 t/ackermann.p6
65533
19729 digits starting with 20035299304068464649790723515602557504478254755697...
Cannot unbox 65536 bit wide bigint into native integer
in sub A at t/ackermann.p6 line 3
in sub A at t/ackermann.p6 line 11
in sub A at t/ackermann.p6 line 3
in block <unit> at t/ackermann.p6 line 17
Removing the proto declaration in line 3 (by commenting out):
$ perl6 t/ackermann.p6
65533
19729 digits starting with 20035299304068464649790723515602557504478254755697...
Numeric overflow
in sub A at t/ackermann.p6 line 8
in sub A at t/ackermann.p6 line 11
in block <unit> at t/ackermann.p6 line 17
What went wrong? The program doesn't allocate much memory. Is the natural integer kind-of limited?
I replaced in the code from Ackermann function the 𝑚 with m and the 𝑛 with n for better terminal interaction for copying errors and tried to comment out proto declaration. I also asked Liz ;)
use v6;
proto A(Int \m, Int \n) { (state #)[m][n] //= {*} }
multi A(0, Int \n) { n + 1 }
multi A(1, Int \n) { n + 2 }
multi A(2, Int \n) { 3 + 2 * n }
multi A(3, Int \n) { 5 + 8 * (2 ** n - 1) }
multi A(Int \m, 0 ) { A(m - 1, 1) }
multi A(Int \m, Int \n) { A(m - 1, A(m, n - 1)) }
# Testing:
say A(4,1);
say .chars, " digits starting with ", .substr(0,50), "..." given A(4,2);
A(4, 3).say;
Please read JJ's answer first. It's breezy and led to this answer which is effectively an elaboration of it.
TL;DR A(4,3) is a very big number, one that cannot be computed in this universe. But raku(do) will try. As it does you will blow past reasonable limits related to memory allocation and indexing if you use the caching version and limits related to numeric calculations if you don't.
I try some examples from Rosettacode and encounter an issue with the provided Ackermann example
Quoting the task description with some added emphasis:
Arbitrary precision is preferred (since the function grows so quickly)
raku's standard integer type Int is arbitrary precision. The raku solution uses them to compute the most advanced answer possible. It only fails when you make it try to do the impossible.
When running it "unmodified" (I replaced the utf-8 variable names by latin-1 ones)
Replacing the variable names is not a significant change.
But adding the A(4,3) line shifted the code from being computable in reality to not being computable in reality.
The example you modified has just one explanatory comment:
Here's a caching version of that ... to make A(4,2) possible
Note that the A(4,2) solution is nearly 20,000 digits long.
If you look at the other solutions on that page most don't even try to reach A(4,2). There are comments like this one on the Phix version:
optimised. still no bignum library, so ack(4,2), which is power(2,65536)-3, which is apparently 19729 digits, and any above, are beyond (the CPU/FPU hardware) and this [code].
A solution for A(4,2) is the most advanced possible.
A(4,3) is not computable in practice
To quote Academic Kids: Ackermann function:
Even for small inputs (4,3, say) the values of the Ackermann function become so large that they cannot be feasibly computed, and in fact their decimal expansions cannot even be stored in the entire physical universe.
So computing A(4,3).say is impossible (in this universe).
It must inevitably lead to an overflow of even arbitrary precision integer arithmetic. It's just a matter of when and how.
Cannot unbox 65536 bit wide bigint into native integer
The first error message mentions this line of code:
proto A(Int \m, Int \n) { (state #)[m][n] //= {*} }
The state # is an anonymous state array variable.
By default # variables use the default concrete type for raku's abstract array type. This default array type provides a balance between implementation complexity and decent performance.
While computing A(4,2) the indexes (m and n) remain small enough that the computation completes without overflowing the default array's indexing limit.
This limit is a "native" integer (note: not a "natural" integer). A "native" integer is what raku calls the fixed width integers supported by the hardware it's running on, typically a long long which in turn is typically 64 bits.
A 64 bit wide index can handle indices up to 9,223,372,036,854,775,807.
But in trying to compute A(4,3) the algorithm generates a 65536 bits (8192 bytes) wide integer index. Such an integer could be as big as 265536, a 20,032 decimal digit number. But the biggest index allowed is a 64 bit native integer. So unless you comment out the caching line that uses an array, then for A(4,3) the program ends up throwing the exception:
Cannot unbox 65536 bit wide bigint into native integer
Limits to allocations and indexing of the default array type
As already explained, there is no array that could be big enough to help fully compute A(4,3). In addition, a 64 bit integer is already a pretty big index (9,223,372,036,854,775,807).
That said, raku can accommodate other array implementations such as Array::Sparse so I'll discuss that briefly below because such possibilities might be of interest for other problems.
But before discussing bigger arrays, running the code below on tio.run shows the practical limits for the default array type on that platform:
my #array;
#array[2**29]++; # works
#array[2**30]++; # could not allocate 8589967360 bytes
#array[2**60]++; # Unable to allocate ... 1152921504606846977 elements
#array[2**63]++; # Cannot unbox 64 bit wide bigint into native integer
(Comment out error lines to see later/greater errors.)
The "could not allocate 8589967360 bytes" error is a MoarVM panic. It's a result of tio.run refusing a memory allocation request.
I think the "Unable to allocate ... elements" error is a raku level exception that's thrown as a result of exceeding some internal Rakudo implementation limit.
The last error message shows the indexing limit for the default array type even if vast amounts of memory were made available to programs.
What if someone wanted to do larger indexing?
It's possible to create/use other # (does Positional) data types that support things like sparse arrays etc.
And, using this mechanism, it's possible that someone could write an array implementation that supports larger integer indexing than is supported by the default array type (presumably by layering logic on top of the underlying platform's instructions; perhaps the Array::Sparse I linked above does).
If such an alternative were called BigArray then the cache line could be replaced with:
my #array is BigArray;
proto A(Int \𝑚, Int \𝑛) { #array[𝑚][𝑛] //= {*} }
Again, this still wouldn't be enough to store interim results for fully computing A(4,3) but my point was to show use of custom array types.
Numeric overflow
When you comment out the caching you get:
Numeric overflow
Raku/Rakudo do arbitrary precision arithmetic. While this is sometimes called infinite precision it obviously isn't actually infinite but is instead, well, "arbitrary", which in this context also means "sane" for some definition of "sane".
This classically means running out of memory to store a number. But in Rakudo's case I think there's an attempt to keep things sane by switching from a truly vast Int to a Num (a floating point number) before completely running out of RAM. But then computing A(4,3) eventually overflows even a double float.
So while the caching blows up sooner, the code is bound to blow up later anyway, and then you'd get a numeric overflow that would either manifest as an out of memory error or a numeric overflow error as it is in this case.
Array subscripts use native ints; that's why you get the error in line 3, when you use the big ints as array subscripts. You might have to define a new BigArray that uses Ints as array subscripts.
The second problem arises in the ** operator: the result is a Real, and when the low-level operations returns a Num, it throws an exception.
https://github.com/rakudo/rakudo/blob/master/src/core/Int.pm6#L391-L401
So creating a BigArray might not be helpful anyway. You'll have to create your own ** too, that always works with Int, but you seem to have hit the (not so infinite) limit of the infinite precision Ints.

How to truncate a 2's complement output

I have data written into short data type. The data written is of 2's complement form.
Now when I try to print the data using %04x, the data with MSB=0 is printed fine for eg if data=740, print I get is 0740
But when the MSB=1, I am unable to get a proper print. For eg if data=842, print I get is fffff842
I want the data truncated to 4 bytes so expected output is f842
Either declare your data as a type which is 16 bits long, or make sure the printing function uses the right format for 16 bits value. Or use your current type, but do a bitwise AND with 0xffff. What you can do depends on the language you're doing it in really.
But whichever way you go, check your assumptions again. There seems to be a few issues in your question:
2s-complement applies to signed numbers only. There are no negative numbers in your question.
Assuming you mean C's short - it doesn't have to be 16 bits long.
"I get is fffff842 I want the data truncated to 4 bytes" - fffff842 is 4 bytes long. f842 is 2 bytes long.
2-bytes long value 842 does not have the MSB set.
I'm assuming C (or possibly C++) as the language here.
Because of the default argument promotions involved when calling a variable argument function (such as printf), your use of a short will result in an integer promotion, which states that "If an int can represent all values of the original type (as restricted by the width, for a
bit-field), the value is converted to an int".
A short is converted to an int by means of sign-extension, and 0xf842 sign-extended to 32 bits is 0xfffff842.
You can use a bitwise AND to mask off the most significant word:
printf("%04x", data & 0xffff);
You could also add the h length specifier to state that you only want to print an (unsigned) short worth of bits from an int:
printf("%04hx", data);

F# .Net portable subset Unicode issues

OK, I've made an F# portable library project in VS2012 and I have some integers that represent Utf-32 encoded characters eg: 0x0001D538 which is a double struck A. Normally to make this into a Utf-16 surrogate pair you would use System.Char.ConvertFromUtf32(i), job done. However, Microsoft have kindly decided not to include this method in the .net portable subset. (it runs fine in the interactive window which must be running the full .net). So, what should I do instead to get my favorite surrogate pairs from these integers? They need to be integers because I do some arithmetic on them. Waiting for the next version of things to come out is a viable option.
Here's a quick translation of the C# from Reflector. Can you use this?
type System.Char with
static member ConvertFromUtf32(utf32) =
if utf32 < 0 || utf32 > 0x10ffff || (utf32 >= 0xd800 && utf32 <= 0xdfff) then
invalidArg "utf32" "Out of range"
elif utf32 < 0x10000 then
new String(char utf32, 1)
else
let utf32 = utf32 - 0x10000
new String([| char ((utf32 / 0x400) + 0xd800); char ((utf32 % 0x400) + 0xdc00) |])

Convert number to hex

I use sprintf for conversion to hex - example >>
$hex = sprintf("0x%x",$d)
But I was wondering, if there is some alternative way how to do it without sprintf.
My goal is convert a number to 4-byte hex code (e.g. 013f571f)
Additionally (and optionally), how can I do such conversion, if number is in 4 * %0xxxxxxx format, using just 7 bits per byte?
sprintf() is probably the most appropriate way. According to http://perldoc.perl.org/functions/hex.html:
To present something as hex, look into printf, sprintf, and unpack.
I'm not really sure about your second question, it sounds like unpack() would be useful there.
My goal is convert a number to 4-byte hex code (e.g. 013f571f)
Hex is a textual representation of a number. sprintf '%X' returns hex (the eight characters 013f571f). sprintf is specifically designed to format numbers into text, so it's a very elegant solution for that.
...But it's not what you want. You're not looking for hex, you're looking for the 4-byte internal storage of an integer. That has nothing to do with hex.
pack 'N', 0x013f571f; # "\x01\x3f\x57\x1f" Big-endian byte order
pack 'V', 0x013f571f; # "\x1f\x57\x3f\x01" Little-endian byte order
sprintf() is my usual way of performing this conversion. You can do it with unpack, but it will probably be more effort on your side.
For only working with 4 byte values, the following will work though (maybe not as elegant as expected!):
print unpack("H8", pack("N1", $d));
Be aware that this will result in 0xFFFFFFFF for numbers bigger than that as well.
For working pack/unpack with arbitrary bit length, check out http://www.perlmonks.org/?node_id=383881
The perlpacktut will be a handy read as well.
For 4 * %0xxxxxxx format, my non-sprintf solution is:
print unpack("H8", pack("N1",
(((($d>>21)&0x7f)<<24) + ((($d>>14)&0x7f)<<16) + ((($d>>7)&0x7f)<<8) + ($d&0x7f))));
Any comments and improvements are very welcome.

MD5/SHA "update" property?

What is the MD5/SHA property that allows you to "update" them? For example, if you have the hash for "test" you can add "case" to get the hash for "testcase". I would like to read up on this property a bit but my searches turn up nothing...
It is merely that they are actually calculated incrementally -- you calculate them by operating on the first n bytes of data, (128 in the case of MD5, see http://en.wikipedia.org/wiki/MD5#Algorithm), then on the next n bytes of data, etc.
EDIT: This isn't even theoretically possible, due to the 1-bit padding I mention below. In effect, md5("case", seed=md5("test")) == md5("test" + <1-bit> + "case"). There is no way to use md5("test") to incrementally compute md5("test" + "case").
This is theoretically possible if you concatenate 512-bit chunks. It won't work for appending "case" to "test", because the first run of the state machine is polluted by the padding used to turn "case" into a 512-bit chunk.
Additionally, the padding isn't just a bunch of zeros. The message is always first padded with a 1 bit, so that "case" and "case\0" produce different hashes. Thus you can't rely on "case" having the same hash with or without padding.
The MD5 algorithm has the following steps:
1) pad input string to a multiple of 64 bytes
2) split input string into blocks of 64 bytes
3) initialise state (a 4-element array)
4) for each block: state <= transform(state,block)
5) encode state as string
To support situations where you want to hash something in stages (e.g. large files), this can be refactored as follows.
Initialise:
1) initialise state
2) leftover bytes <= ""
Update:
1) append leftover bytes to start of input string
2) split input string into blocks of 64 bytes
3) for each complete block: state <= transform(state,block)
4) leftover bytes <= contents of the incomplete block, if one exists
Digest:
1) pad a copy of the leftover bytes
2) split the padded leftover bytes into blocks of 64 bytes
2) tmp_state <= state
2) for each block: tmp_state <= transform(tmp_state,block)
3) encode tmp_state as string
I've actually implemented this approach in VBA - it seems to work fine. Any suggestions for where I should upload the code?