Why this function uses a lot of memory? - perl

I'm trying to unpack binary vector of 140 Million bits into list.
I'm checking the memory usage of this function, but it looks weird. the memory usage rises to 35GB (GB and not MB). how can I reduce the memory usage?
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return #unpacked;
}

Scalars contain a lot of information.
$ perl -MDevel::Peek -e'Dump("0")'
SV = PV(0x42a8330) at 0x42c57b8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x42ce670 "0"\0
CUR = 1
LEN = 16
In order to keep them as small as possible, a scalar consists of two memory blocks[1], a fixed-sized head, and a body that can be "upgraded" to contain more information.
The smallest type of scalar that can contain a string (such as the ones returned by split) is a SVt_PV. (It's usually called PV, but PV can also refer to the name of the field that points to the string buffer, so I'll go with the name of the constant.)
The first block is the head.
ANY is a pointer to the body.
REFCNT is a reference count that allows Perl to know when the scalar can be deallocated.
FLAGS contains information about what the scalar actually contains. (e.g. SVf_POK means the scalar contains a string.)
TYPE contains information the type of scalar (what kind of information it can contain.)
For an SVt_PV, the last field points to the string buffer.
The second block is the body. The body of an SVt_PV has the following fields:
STASH is not used in the scalars in question since they're not objects.
MAGIC is not used for the scalars in question. Magic allows code to be called when the variable is accessed.
CUR is the length of the string in the buffer.
LEN is the length of the string buffer. Perl over-allocates to speed up concatenation.
The block on the right is the string buffer. As you might have noticed, Perl over-allocates. This speeds up concatenation.
Ignore the block on the bottom. It's an alternative to the string buffer format for special strings (e.g. hash keys).
To how much does that add up?
$ perl -MDevel::Size=total_size -E'say total_size("0")'
28 # 32-bit Perl
56 # 64-bit Perl
That's just for the scalar itself. It doesn't take into the overhead in the memory allocation system of three memory blocks.
These scalars are in an array. An array is really just a scalar.
So an array has overheard.
$ perl -MDevel::Size=total_size -E'say total_size([])'
56 # 32-bit Perl
64 # 64-bit Perl
That's an empty array. You have 140 million of the scalars in yours, so it needs a buffer that can contain 140 million pointers. (In this particular case, the array won't be over-allocated, at least.) Each pointer is 4 bytes on a 32-bit system, 8 on a 64.
That brings the total up to:
32-bit: 56 + (4 + 28) * 140,000,000 = 4,480,000,056
64-bit: 64 + (8 + 56) * 140,000,000 = 8,960,000,064
That doesn't factor in the memory allocation overhead, but it's still very different from the numbers you gave. Why? Well, the scalars returned by split are actually different than the scalars inside the array. So for a moment, you actually have 280,000,000 scalars in memory!
The rest of the memory is probably held by lexical variables in subs that aren't currently executing. Lexical variables aren't normally freed on scope exit since it's expected that the sub will need the memory the next time it's called. That means bin2list continues to use up 140MB of memory after it exits.
Footnotes
Scalars that are undefined can get away without a body until a value is assigned to them. Scalars that contain only an integer can get away without allocating a memory block for the body by storing the integer in the same field as a SVt_PV stores the pointer to the string buffer.
The images are from illguts. They are protected by Copyright.

A single integer value in Perl is going to be stored in an SVt_IV or SVt_UV scalar, whose size will be four machine-sized words - so on a 32bit machine, 16 bytes. An array of 140 million of those, therefore, is going to consume 2.2 billion bytes, presuming it is densely packed together. Add to that the SV * pointers in the AvARRAY used to reference them and we're now at 2.8 billion bytes. Now double that, because you copied the array when you returned it, and we're now at 5.6 billion bytes.
That of course was on a 32bit machine - on a 64bit machine we're at double again, so 11.2 billion bytes. This presumes totally dense packing inside the memory - in practice this will be allocated in stages and chunks, so RAM fragmentation will further add to this. I could imagine a total size around the 35 billion byte mark for this. It doesn't sound outlandishly unreasonable.
For a very easy way to massively reduce the memory usage (not to mention CPU time required), rather than returning the array itself as a list, return a reference to it. Then a single reference is returned rather than a huge list of 140 million SVs; this avoids a second copy also.
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return \#unpacked;
}

Related

Does Perl's Glob have a limitation?

I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28
The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.
Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}

perl6: Cannot unbox 65536 bit wide bigint into native integer

I try some examples from Rosettacode and encounter an issue with the provided Ackermann example: When running it "unmodified" (I replaced the utf-8 variable names by latin-1 ones), I get (similar, but now copyable):
$ perl6 t/ackermann.p6
65533
19729 digits starting with 20035299304068464649790723515602557504478254755697...
Cannot unbox 65536 bit wide bigint into native integer
in sub A at t/ackermann.p6 line 3
in sub A at t/ackermann.p6 line 11
in sub A at t/ackermann.p6 line 3
in block <unit> at t/ackermann.p6 line 17
Removing the proto declaration in line 3 (by commenting out):
$ perl6 t/ackermann.p6
65533
19729 digits starting with 20035299304068464649790723515602557504478254755697...
Numeric overflow
in sub A at t/ackermann.p6 line 8
in sub A at t/ackermann.p6 line 11
in block <unit> at t/ackermann.p6 line 17
What went wrong? The program doesn't allocate much memory. Is the natural integer kind-of limited?
I replaced in the code from Ackermann function the 𝑚 with m and the 𝑛 with n for better terminal interaction for copying errors and tried to comment out proto declaration. I also asked Liz ;)
use v6;
proto A(Int \m, Int \n) { (state #)[m][n] //= {*} }
multi A(0, Int \n) { n + 1 }
multi A(1, Int \n) { n + 2 }
multi A(2, Int \n) { 3 + 2 * n }
multi A(3, Int \n) { 5 + 8 * (2 ** n - 1) }
multi A(Int \m, 0 ) { A(m - 1, 1) }
multi A(Int \m, Int \n) { A(m - 1, A(m, n - 1)) }
# Testing:
say A(4,1);
say .chars, " digits starting with ", .substr(0,50), "..." given A(4,2);
A(4, 3).say;
Please read JJ's answer first. It's breezy and led to this answer which is effectively an elaboration of it.
TL;DR A(4,3) is a very big number, one that cannot be computed in this universe. But raku(do) will try. As it does you will blow past reasonable limits related to memory allocation and indexing if you use the caching version and limits related to numeric calculations if you don't.
I try some examples from Rosettacode and encounter an issue with the provided Ackermann example
Quoting the task description with some added emphasis:
Arbitrary precision is preferred (since the function grows so quickly)
raku's standard integer type Int is arbitrary precision. The raku solution uses them to compute the most advanced answer possible. It only fails when you make it try to do the impossible.
When running it "unmodified" (I replaced the utf-8 variable names by latin-1 ones)
Replacing the variable names is not a significant change.
But adding the A(4,3) line shifted the code from being computable in reality to not being computable in reality.
The example you modified has just one explanatory comment:
Here's a caching version of that ... to make A(4,2) possible
Note that the A(4,2) solution is nearly 20,000 digits long.
If you look at the other solutions on that page most don't even try to reach A(4,2). There are comments like this one on the Phix version:
optimised. still no bignum library, so ack(4,2), which is power(2,65536)-3, which is apparently 19729 digits, and any above, are beyond (the CPU/FPU hardware) and this [code].
A solution for A(4,2) is the most advanced possible.
A(4,3) is not computable in practice
To quote Academic Kids: Ackermann function:
Even for small inputs (4,3, say) the values of the Ackermann function become so large that they cannot be feasibly computed, and in fact their decimal expansions cannot even be stored in the entire physical universe.
So computing A(4,3).say is impossible (in this universe).
It must inevitably lead to an overflow of even arbitrary precision integer arithmetic. It's just a matter of when and how.
Cannot unbox 65536 bit wide bigint into native integer
The first error message mentions this line of code:
proto A(Int \m, Int \n) { (state #)[m][n] //= {*} }
The state # is an anonymous state array variable.
By default # variables use the default concrete type for raku's abstract array type. This default array type provides a balance between implementation complexity and decent performance.
While computing A(4,2) the indexes (m and n) remain small enough that the computation completes without overflowing the default array's indexing limit.
This limit is a "native" integer (note: not a "natural" integer). A "native" integer is what raku calls the fixed width integers supported by the hardware it's running on, typically a long long which in turn is typically 64 bits.
A 64 bit wide index can handle indices up to 9,223,372,036,854,775,807.
But in trying to compute A(4,3) the algorithm generates a 65536 bits (8192 bytes) wide integer index. Such an integer could be as big as 265536, a 20,032 decimal digit number. But the biggest index allowed is a 64 bit native integer. So unless you comment out the caching line that uses an array, then for A(4,3) the program ends up throwing the exception:
Cannot unbox 65536 bit wide bigint into native integer
Limits to allocations and indexing of the default array type
As already explained, there is no array that could be big enough to help fully compute A(4,3). In addition, a 64 bit integer is already a pretty big index (9,223,372,036,854,775,807).
That said, raku can accommodate other array implementations such as Array::Sparse so I'll discuss that briefly below because such possibilities might be of interest for other problems.
But before discussing bigger arrays, running the code below on tio.run shows the practical limits for the default array type on that platform:
my #array;
#array[2**29]++; # works
#array[2**30]++; # could not allocate 8589967360 bytes
#array[2**60]++; # Unable to allocate ... 1152921504606846977 elements
#array[2**63]++; # Cannot unbox 64 bit wide bigint into native integer
(Comment out error lines to see later/greater errors.)
The "could not allocate 8589967360 bytes" error is a MoarVM panic. It's a result of tio.run refusing a memory allocation request.
I think the "Unable to allocate ... elements" error is a raku level exception that's thrown as a result of exceeding some internal Rakudo implementation limit.
The last error message shows the indexing limit for the default array type even if vast amounts of memory were made available to programs.
What if someone wanted to do larger indexing?
It's possible to create/use other # (does Positional) data types that support things like sparse arrays etc.
And, using this mechanism, it's possible that someone could write an array implementation that supports larger integer indexing than is supported by the default array type (presumably by layering logic on top of the underlying platform's instructions; perhaps the Array::Sparse I linked above does).
If such an alternative were called BigArray then the cache line could be replaced with:
my #array is BigArray;
proto A(Int \𝑚, Int \𝑛) { #array[𝑚][𝑛] //= {*} }
Again, this still wouldn't be enough to store interim results for fully computing A(4,3) but my point was to show use of custom array types.
Numeric overflow
When you comment out the caching you get:
Numeric overflow
Raku/Rakudo do arbitrary precision arithmetic. While this is sometimes called infinite precision it obviously isn't actually infinite but is instead, well, "arbitrary", which in this context also means "sane" for some definition of "sane".
This classically means running out of memory to store a number. But in Rakudo's case I think there's an attempt to keep things sane by switching from a truly vast Int to a Num (a floating point number) before completely running out of RAM. But then computing A(4,3) eventually overflows even a double float.
So while the caching blows up sooner, the code is bound to blow up later anyway, and then you'd get a numeric overflow that would either manifest as an out of memory error or a numeric overflow error as it is in this case.
Array subscripts use native ints; that's why you get the error in line 3, when you use the big ints as array subscripts. You might have to define a new BigArray that uses Ints as array subscripts.
The second problem arises in the ** operator: the result is a Real, and when the low-level operations returns a Num, it throws an exception.
https://github.com/rakudo/rakudo/blob/master/src/core/Int.pm6#L391-L401
So creating a BigArray might not be helpful anyway. You'll have to create your own ** too, that always works with Int, but you seem to have hit the (not so infinite) limit of the infinite precision Ints.

How are scalars stored 'under the hood' in perl?

The basic types in perl are different then most languages, with types being scalar, array, hash (but apparently not subroutines, &, which I guess are really just scalar references with syntactical sugar). What is most odd about this is that the most common data types: int, boolean, char, string, all fall under the basic data type "scalar". It seems that perl decides rather to treat a scalar as a string, boolean, or number based off of the operator that modifies it, implying the scalar itself is not actually defined as "int" or "String" when saved.
This makes me curious as to how these scalars are stored "under the hood", particularly in regards to it's effect on efficiency (yes I know scripting languages sacrifice efficiency for flexibility, but they still need to be as optimized as possible when flexibility concerns are not affected). It's much easier for me to store the number 65535 (which takes two bytes) then the string "65535" which takes 6 bytes, as such recognizing that $val = 65535 is storing an int would allow me to use 1/3 the memory, in large arrays this could mean fewer cache hits as well.
It's not just limited to saving memory of course. There are times when I can offer more significant optimizations if I know what type of scalar to expect. For instance if I have a hash using very large integers as keys it would be far faster to look up a value if I recognizing the keys as ints, allowing a simply modulo for creating my hash key, then if I have to run more complex hashing logic on a string that has 3 times the bytes.
So I'm wondering how perl handles these scalars under the hood. Does it store every value as a string, sacrificing the extra memory and cpu cost of constant converting string to int in the case that a scalar is always used as an int? Or does it have some logic for inference the type of scalar used to determine how to save and manipulate it?
Edit:
TJD linked to perlguts, which answers half my question. A scalar is actually stored as string, int (signed, unsigned, double) or pointer. I'm not too surprised, I had mostly expected this behavior to occur under the hood, though it's interesting to see the exact types. I'm leaving this question open though because perlguts is actually to low level. Other then telling me that 5 data types exist it doesn't specify how perl works to alternate between them, ie how perl decides which SV type to use when a scalar is saved and how it knows when/how to cast.
There are actually a number of types of scalars. A scalar of type SVt_IV can hold undef, a signed integer (IV) or an unsigned integer (UV). One of type SVt_PVIV can also hold a string[1]. Scalars are silently upgraded from one type to another as needed[2]. The TYPE field indicates the type of a scalar. In fact, arrays (SVt_AV) and hashes (SVt_HV) are really just types of scalars.
While the type of a scalar indicates what the scalar can contain, flags are used to indicate what a scalar does contain. This is stored in the FLAGS field. SVf_IOK signals that a scalar contains a signed integer, while SVf_POK indicates it contains a string[3].
Devel::Peek's Dump is a great tool for looking at the internals of scalars. (The constant prefixes SVt_ and SVf_ are omitted by Dump.)
$ perl -e'
use Devel::Peek qw( Dump );
my $x = 123;
Dump($x);
$x = "456";
Dump($x);
$x + 0;
Dump($x);
'
SV = IV(0x25f0d20) at 0x25f0d30 <-- SvTYPE(sv) == SVt_IV, so it can contain an IV.
REFCNT = 1
FLAGS = (IOK,pIOK) <-- IOK: Contains an IV.
IV = 123 <-- The contained signed integer (IV).
SV = PVIV(0x25f5ce0) at 0x25f0d30 <-- The SV has been upgraded to SVt_PVIV
REFCNT = 1 so it can also contain a string now.
FLAGS = (POK,IsCOW,pPOK) <-- POK: Contains a string (but no IV since !IOK).
IV = 123 <-- Meaningless without IOK.
PV = 0x25f9310 "456"\0 <-- The contained string.
CUR = 3 <-- Number of bytes used by PV (not incl \0).
LEN = 10 <-- Number of bytes allocated for PV.
COW_REFCNT = 1
SV = PVIV(0x25f5ce0) at 0x25f0d30
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK) <-- Now contains both a string (POK) and an IV (IOK).
IV = 456 <-- This will be used in numerical contexts.
PV = 0x25f9310 "456"\0 <-- This will be used in string contexts.
CUR = 3
LEN = 10
COW_REFCNT = 1
illguts documents the internal format of variables quite thoroughly, but perlguts might be a better place to start.
If you start writing XS code, keep in mind it's usually a bad idea to check what a scalar contains. Instead, you should request what should have been provided (e.g. using SvIV or SvPVutf8). Perl will automatically convert the value to the requested type (and warn if appropriate). API calls are documented in perlapi.
In fact, it can hold a string an either a signed integer or an unsigned integer at the same time.
All scalars (including arrays and hashes, excluding one type of scalar that can only hold undef) have two memory blocks at their base. Pointers to the scalar point to its head, which contains the TYPE field and a pointer to the body. Upgrading a scalar replaces the body of the scalar. That way, pointers to the scalar aren't invalidated by an upgrade.
An undef variable is one without any uppercase OK flags set.
The formats used by Perl for data storage are documented in the perlguts perldoc.
In short, though, a Perl scalar is stored as a SV structure containing one of a number of different types, such as an int, a double, a char *, or a pointer to another scalar. (These types are stored as a C union, so only one of them will be present at a time; the SV contains flags indicating which type is used.)
(With regard to hash keys, there's an important gotcha to note there: hash keys are always strings, and are always stored as strings. They're stored in a different type from other scalars.)
The Perl API includes a number of functions which can be used to access the value of a scalar as a desired C type. For example, SvIV() can be used to return the integer value of a SV: if the SV contains an int, that value is returned directly; if the SV contains another type, it's coerced to an integer as appropriate. These functions are used throughout the Perl interpreter for type conversions. However, there is no automatic inference of types on output; functions which operate on strings will always return a PV (string) scalar, for instance, regardless of whether the string "looks like" a number or not.
If you're curious what a given scalar looks like internally, you can use the Devel::Peek module to dump its contents.
Others have addressed the "how are scalars stored" part of your question, so I'll skip that. With regard to how Perl decides which representation of a value to use and when to convert between them, the answer is it depends on which operators are applied to the scalar. For example, given this code:
my $score = 0;
The scalar $score will be initialised with an integer value. But then when this line of code is run:
say "Your score is $score";
The double quote operator means that Perl will need a string representation of the value. So the conversion from integer to string will take place as part of the process of assembling the string argument to the say function. Interestingly, after the stringification of $score, the underlying representation of the scalar will now include both an integer and a string representation, allowing subsequent operations to directly grab the relevant value without having to convert again. If a numeric operator is then applied to the string (e.g.: $score++) then the numeric part will be updated and the (now invalid) string part will be discarded.
This is the reason why Perl operators tend to come in two flavours. For example comparing values of numbers is done with <, ==, > while performing the same comparisons with strings would be done with lt, eq, gt. Perl will coerce the value of the scalar(s) to the type which matches the operator. This is why the + operator does numeric addition in Perl but a separate operator . is needed to do string concatenation: + will coerce its arguments to numeric values and . will coerce to strings.
There are some operators that will work with both numeric and string values but which perform a different operation depending on the type of value. For example:
$score = 0;
say ++$score; # 1
say ++$score; # 2
say ++$score; # 3
$score = 'aaa';
say ++$score; # 'aaa'
say ++$score; # 'aab'
say ++$score; # 'aac'
With regard to questions of efficiency (and bearing in mind standard disclaimers about premature optimisation etc). Consider this code which reads a file containing one integer per line, each integer is validated to check it is exactly 8 digits long and the valid ones are stored in an array:
my #numbers;
while(<$fh>) {
if(/^(\d{8})$/) {
push #numbers, $1;
}
}
Any data read from a file will initially come to us as a string. The regex used to validate the data will also require a string value in $_. So the result is that our array #numbers will contain a list of strings. However, if further uses of the values will be solely in a numeric context, we could use this micro-optimisation to ensure that the array contained only numeric values:
push #numbers, 0 + $1;
In my tests with a file of 10,000 lines, populating #numbers with strings used nearly three times as much memory as populating with integer values. However as with most benchmarks, this has little relevance to normal day-to-day coding in Perl. You'd only need to worry about that in situations where you a) had performance or memory issues and b) were working with a large number of values.
It's worth pointing out that some of this behaviour is common to other dynamic languages (e.g.: Javascript will silently coerce numeric values to strings).

Why does Perl reallocate memory following this pattern?

The memory addresses for anonymous arrays are naturally re-used by perl. As this example shows, they cycle between two addresses for empty arrays:
$ perl -E "say [] for (1..6)"
ARRAY(0x37b23c)
ARRAY(0x37b28c)
ARRAY(0x37b23c)
ARRAY(0x37b28c)
ARRAY(0x37b23c)
ARRAY(0x37b28c)
I came up with some theories on why it couldn't reallocate the memory immediately, when I found that the cycle isn't always two addresses long. The following examples' cycles are 3 and 4.
$ perl -E "say [0] for (1..6)"
ARRAY(0x39b23c)
ARRAY(0x39b2ac)
ARRAY(0x39b28c)
ARRAY(0x39b23c)
ARRAY(0x39b2ac)
ARRAY(0x39b28c)
$ perl -E "say [0,0] for (1..6)"
ARRAY(0x64b23c)
ARRAY(0x64b2cc)
ARRAY(0x64b2ac)
ARRAY(0x64b28c)
ARRAY(0x64b23c)
ARRAY(0x64b2cc)
What causes this peculiarity of memory management?
When SVs are freed, the are actually put into a "free" pool. Perhaps the order into which they enter the pool affects the order in which they exit.
Within the set of examples you've given, the number of addresses is not "two, or sometimes more". It's "the number of elements in the anonymous array, plus two". As ikegami said, the SVs go into a pool when freed, so it is to be expected that the addresses will cycle in some fashion, unless a deliberate effort has been made to retrieve them in a random order (which has obviously not been done).
The remaining question, then, is why the length of the cycle is "number of elements + 2". Perhaps it's using one SV for each element of the array, one for the arrayref itself, and one for $_?

MD5/SHA "update" property?

What is the MD5/SHA property that allows you to "update" them? For example, if you have the hash for "test" you can add "case" to get the hash for "testcase". I would like to read up on this property a bit but my searches turn up nothing...
It is merely that they are actually calculated incrementally -- you calculate them by operating on the first n bytes of data, (128 in the case of MD5, see http://en.wikipedia.org/wiki/MD5#Algorithm), then on the next n bytes of data, etc.
EDIT: This isn't even theoretically possible, due to the 1-bit padding I mention below. In effect, md5("case", seed=md5("test")) == md5("test" + <1-bit> + "case"). There is no way to use md5("test") to incrementally compute md5("test" + "case").
This is theoretically possible if you concatenate 512-bit chunks. It won't work for appending "case" to "test", because the first run of the state machine is polluted by the padding used to turn "case" into a 512-bit chunk.
Additionally, the padding isn't just a bunch of zeros. The message is always first padded with a 1 bit, so that "case" and "case\0" produce different hashes. Thus you can't rely on "case" having the same hash with or without padding.
The MD5 algorithm has the following steps:
1) pad input string to a multiple of 64 bytes
2) split input string into blocks of 64 bytes
3) initialise state (a 4-element array)
4) for each block: state <= transform(state,block)
5) encode state as string
To support situations where you want to hash something in stages (e.g. large files), this can be refactored as follows.
Initialise:
1) initialise state
2) leftover bytes <= ""
Update:
1) append leftover bytes to start of input string
2) split input string into blocks of 64 bytes
3) for each complete block: state <= transform(state,block)
4) leftover bytes <= contents of the incomplete block, if one exists
Digest:
1) pad a copy of the leftover bytes
2) split the padded leftover bytes into blocks of 64 bytes
2) tmp_state <= state
2) for each block: tmp_state <= transform(tmp_state,block)
3) encode tmp_state as string
I've actually implemented this approach in VBA - it seems to work fine. Any suggestions for where I should upload the code?